You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Bago Amirbekian (JIRA)" <ji...@apache.org> on 2017/10/24 23:24:00 UTC

[jira] [Created] (SPARK-22346) Update VectorAssembler to work with StreamingDataframes

Bago Amirbekian created SPARK-22346:
---------------------------------------

Summary: Update VectorAssembler to work with StreamingDataframes
Key: SPARK-22346
URL: https://issues.apache.org/jira/browse/SPARK-22346
Project: Spark
Issue Type: Improvement
Components: ML, Structured Streaming
Affects Versions: 2.2.0
Reporter: Bago Amirbekian
Priority: Critical

The issue
In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines.

I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach.

Potential fixes
1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [#13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction.
Pros:
* Possibly simplest of the potential fixes
Cons:
* We'll need to deprecate current VectorAssembler

2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks.
Pros:
* Potentially, easy short term fix for VectorAssembler
Cons:
* To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings.
* A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions.

3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size.
Pros:
* We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume.
* This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known.
Cons:
* This would require breaking changes.

--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org