You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by MrBago <gi...@git.apache.org> on 2017/12/29 04:30:33 UTC

[GitHub] spark pull request #20112: [SPARK-22734][ML][PySpark] Added Python API for V...

GitHub user MrBago opened a pull request:

    https://github.com/apache/spark/pull/20112

    [SPARK-22734][ML][PySpark] Added Python API for VectorSizeHint.

    (Please fill in changes proposed in this fix)
    
    Python API for VectorSizeHint Transformer.
    
    (Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
    
    doc-tests.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/MrBago/spark vectorSizeHint-PythonAPI

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/20112.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #20112
    
----
commit 83bb7ded0d58d4173671904a452039b57bcbea3d
Author: Bago Amirbekian <ba...@...>
Date:   2017-12-29T03:05:53Z

    Added Python API for VectorSizeHint.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Looks good except for the style issue


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85537 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85537/testReport)** for PR 20112 at commit [`de9ea9f`](https://github.com/apache/spark/commit/de9ea9f8bca13771ebd6df1f1a70939fce59a88a).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85527 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85527/testReport)** for PR 20112 at commit [`9a1ee2b`](https://github.com/apache/spark/commit/9a1ee2bf2b1c3efa6f666ce6f147fd089d32541d).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85496/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85496 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85496/testReport)** for PR 20112 at commit [`83bb7de`](https://github.com/apache/spark/commit/83bb7ded0d58d4173671904a452039b57bcbea3d).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReadable,`


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85496 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85496/testReport)** for PR 20112 at commit [`83bb7de`](https://github.com/apache/spark/commit/83bb7ded0d58d4173671904a452039b57bcbea3d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    LGTM
    merging with master
    Thank you!


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85537 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85537/testReport)** for PR 20112 at commit [`de9ea9f`](https://github.com/apache/spark/commit/de9ea9f8bca13771ebd6df1f1a70939fce59a88a).
     * This patch passes all tests.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20112: [SPARK-22734][ML][PySpark] Added Python API for V...

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/spark/pull/20112


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Test PASSed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85537/
    Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20112: [SPARK-22734][ML][PySpark] Added Python API for V...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20112#discussion_r159097278
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -3466,6 +3466,72 @@ def selectedFeatures(self):
             return self._call_java("selectedFeatures")
     
     
    +@inherit_doc
    +class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReadable,
    --- End diff --
    
    You'll need to override handleInvalid, like in the Scala API, since it takes different values & has a different docstring.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85527 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85527/testReport)** for PR 20112 at commit [`9a1ee2b`](https://github.com/apache/spark/commit/9a1ee2bf2b1c3efa6f666ce6f147fd089d32541d).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85527/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85528 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85528/testReport)** for PR 20112 at commit [`1ec7a41`](https://github.com/apache/spark/commit/1ec7a4161114d4e488f221c24c1c20f7f6917cf6).
     * This patch **fails Python style tests**.
     * This patch merges cleanly.
     * This patch adds no public classes.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Merged build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Merged build finished. Test PASSed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark pull request #20112: [SPARK-22734][ML][PySpark] Added Python API for V...

Posted by jkbradley <gi...@git.apache.org>.
Github user jkbradley commented on a diff in the pull request:

    https://github.com/apache/spark/pull/20112#discussion_r159096655
  
    --- Diff: python/pyspark/ml/feature.py ---
    @@ -3466,6 +3466,72 @@ def selectedFeatures(self):
             return self._call_java("selectedFeatures")
     
     
    +@inherit_doc
    +class VectorSizeHint(JavaTransformer, HasInputCol, HasHandleInvalid, JavaMLReadable,
    +                     JavaMLWritable):
    +    """
    +    A feature transformer that adds size information to the metadata of a vector column.
    +    VectorAssembler needs size information for its input columns and cannot be used on streaming
    +    dataframes without this metadata.
    +
    +    >>> from pyspark.ml.linalg import Vectors
    +    >>> from pyspark.ml import Pipeline, PipelineModel
    +    >>> data = [(Vectors.dense([1., 2., 3.]), 4.)]
    +    >>> df = spark.createDataFrame(data, ["vector", "float"])
    +    >>>
    +    >>> sizeHint = VectorSizeHint(inputCol="vector", size=3, handleInvalid="skip")
    +    >>> vecAssembler = VectorAssembler(inputCols=["vector", "float"], outputCol="assembled")
    +    >>> pipeline = Pipeline(stages=[sizeHint, vecAssembler])
    +    >>>
    +    >>> pipelineModel = pipeline.fit(df)
    +    >>> pipelineModel.transform(df).head().assembled
    +    DenseVector([1.0, 2.0, 3.0, 4.0])
    +    >>> vectorSizeHintPath = temp_path + "/vector-size-hint-pipeline"
    +    >>> pipelineModel.save(vectorSizeHintPath)
    +    >>> loadedPipeline = PipelineModel.load(vectorSizeHintPath)
    +    >>> loaded = loadedPipeline.transform(df).head().assembled
    +    >>> expected = pipelineModel.transform(df).head().assembled
    +    >>> loaded == expected
    +    True
    +
    +    .. versionadded:: 2.3.0
    +    .. note:: Experimental
    +    """
    +
    +    size = Param(Params._dummy(), "size", "Size of vectors in column.",
    +                 typeConverter=TypeConverters.toInt)
    +
    +    @since("2.3.0")
    +    def getSize(self):
    +        """ Gets size param, the size of vectors in `inputCol`."""
    +        self.getOrDefault(self.size)
    +
    +    @since("2.3.0")
    +    def setSize(self, value):
    +        """ Sets size param, the size of vectors in `inputCol`."""
    +        self._set(size=value)
    +
    +    @keyword_only
    +    def __init__(self, inputCol=None, size=None, handleInvalid="error"):
    --- End diff --
    
    Let's stick with the order which all other python classes follow: dummy Params, __init__, Param setters & getters


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    **[Test build #85528 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/85528/testReport)** for PR 20112 at commit [`1ec7a41`](https://github.com/apache/spark/commit/1ec7a4161114d4e488f221c24c1c20f7f6917cf6).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #20112: [SPARK-22734][ML][PySpark] Added Python API for VectorSi...

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/20112
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/85528/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org