You are viewing a plain text version of this content. The canonical link for it is here.
Posted to reviews@spark.apache.org by tengpeng <gi...@git.apache.org> on 2018/06/09 20:24:48 UTC
[GitHub] spark pull request #21522: [SPARK-24467][ML] VectorAssemblerEstimator
GitHub user tengpeng opened a pull request:
https://github.com/apache/spark/pull/21522
[SPARK-24467][ML] VectorAssemblerEstimator
Background: See the JIRA ticket.
This PR is on its very early stage, and hopefully it would help us decide what's the right direction.
## What changes were proposed in this pull request?
1. Add a optional Param to VectorAssembler for specifying the sizes of Vectors in the inputCols.
- If not given, then VectorAssembler will behave as it does now.
- If given, then VectorAssembler can use that info instead of figuring out the Vector sizes via metadata or examining Rows in the data. And it does consistency checks.
2. Add a VectorAssemblerEstimator which gets the Vector lengths from data and produces a VectorAssembler_Model_ with the vector lengths Param specified.
Todos:
1. Reduce code duplication. Not sure if want to have a trait that reduces duplication between `VectorAssembler` and `VectorAssemblerEstimator`, like 'OneHotEncoderBase'.
2. comments & documentations etc.
## How was this patch tested?
Added unit tests.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/tengpeng/spark Spark-24467
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21522.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21522
----
commit 8e3aa44c3937d60d5aa35dd03604e57ef218ebb4
Author: Teng Peng <jo...@...>
Date: 2018-06-09T12:48:30Z
Add a param to VectorAssembler for specifying the sizes of Vectors. Add a VectorAssemblerEstimator.
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21522
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21522
**[Test build #96700 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96700/testReport)** for PR 21522 at commit [`8e3aa44`](https://github.com/apache/spark/commit/8e3aa44c3937d60d5aa35dd03604e57ef218ebb4).
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21522
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21522
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21522
Jenkins ok to test.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21522
cc @jkbradley again
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by holdenk <gi...@git.apache.org>.
Github user holdenk commented on the issue:
https://github.com/apache/spark/pull/21522
cc @jkbradley as the reporter of this issue you might want to take a look.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by SparkQA <gi...@git.apache.org>.
Github user SparkQA commented on the issue:
https://github.com/apache/spark/pull/21522
**[Test build #96700 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/96700/testReport)** for PR 21522 at commit [`8e3aa44`](https://github.com/apache/spark/commit/8e3aa44c3937d60d5aa35dd03604e57ef218ebb4).
* This patch **fails Java style tests**.
* This patch **does not merge cleanly**.
* This patch adds the following public classes _(experimental)_:
* `class VectorAssemblerEstimator(override val uid: String)`
* `class VectorAssemblerModel(override val uid: String, val vectorColsLengths: Map[String, Int])`
* ` class VectorAssemblerModelWriter(instance: VectorAssemblerModel) extends MLWriter `
* ` class VectorAssemblerModelReader extends MLReader[VectorAssemblerModel] `
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21522
Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/96700/
Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21522
Build finished. Test FAILed.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org
[GitHub] spark issue #21522: [SPARK-24467][ML] VectorAssemblerEstimator
Posted by AmplabJenkins <gi...@git.apache.org>.
Github user AmplabJenkins commented on the issue:
https://github.com/apache/spark/pull/21522
Can one of the admins verify this patch?
---
---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org