You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tengpeng <gi...@git.apache.org> on 2018/06/09 20:24:48 UTC

[GitHub] spark pull request #21522: [SPARK-24467][ML] VectorAssemblerEstimator

GitHub user tengpeng opened a pull request:

    https://github.com/apache/spark/pull/21522

    [SPARK-24467][ML] VectorAssemblerEstimator

    Background: See the JIRA ticket.
    
    This PR is on its very early stage, and hopefully it would help us decide what's the right direction.
    
    ## What changes were proposed in this pull request? 
    
    1. Add a optional Param to VectorAssembler for specifying the sizes of Vectors in the inputCols. 
    - If not given, then VectorAssembler will behave as it does now. 
    - If given, then VectorAssembler can use that info instead of figuring out the Vector sizes via metadata or examining Rows in the data. And it does consistency checks.
    2. Add a VectorAssemblerEstimator which gets the Vector lengths from data and produces a VectorAssembler_Model_ with the vector lengths Param specified.
    
    Todos:
    1. Reduce code duplication. Not sure if want to have a trait that reduces duplication between `VectorAssembler` and `VectorAssemblerEstimator`, like 'OneHotEncoderBase'.
    2. comments & documentations etc.
    
    
    ## How was this patch tested?
    Added unit tests.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/tengpeng/spark Spark-24467

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21522.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21522
    
----
commit 8e3aa44c3937d60d5aa35dd03604e57ef218ebb4
Author: Teng Peng <jo...@...>
Date:   2018-06-09T12:48:30Z

    Add a param to VectorAssembler for specifying the sizes of Vectors. Add a VectorAssemblerEstimator.

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    **[Test build #96700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96700/testReport)** for PR 21522 at commit [`8e3aa44`](https://github.com/apache/spark/commit/8e3aa44c3937d60d5aa35dd03604e57ef218ebb4).


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    Jenkins ok to test.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    cc @jkbradley again


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    cc @jkbradley as the reporter of this issue you might want to take a look.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    **[Test build #96700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96700/testReport)** for PR 21522 at commit [`8e3aa44`](https://github.com/apache/spark/commit/8e3aa44c3937d60d5aa35dd03604e57ef218ebb4).
     * This patch **fails Java style tests**.
     * This patch **does not merge cleanly**.
     * This patch adds the following public classes _(experimental)_:
      * `class VectorAssemblerEstimator(override val uid: String)`
      * `class VectorAssemblerModel(override val uid: String, val vectorColsLengths: Map[String, Int])`
      * `  class VectorAssemblerModelWriter(instance: VectorAssemblerModel) extends MLWriter `
      * `  class VectorAssemblerModelReader extends MLReader[VectorAssemblerModel] `


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    Test FAILed.
    Refer to this link for build results (access rights to CI server needed): 
    https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96700/
    Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    Build finished. Test FAILed.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator

Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:

    https://github.com/apache/spark/pull/21522
  
    Can one of the admins verify this patch?


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org