You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2017/10/25 22:42:00 UTC
[jira] [Commented] (SPARK-22346) Update VectorAssembler to work with Structured Streaming

    [ https://issues.apache.org/jira/browse/SPARK-22346?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16219686#comment-16219686 ] 

Joseph K. Bradley commented on SPARK-22346:
-------------------------------------------

My thoughts on the options.  Summary: I'm ambivalent between Options 2 & 3 since I think either could be done without breaking changes.

* Option 1 (VectorAssembler as an Estimator): too drastic
** This would break almost every MLlib workflow I've seen.

* Option 2 (drop metadata when unavailable): good if we're careful
** I think we can avoid breaking changes here.
** We can drop metadata only when (a) part of the metadata are unavailable and (b) the DataFrame is a streaming DataFrame.  That way, we won't change existing MLlib workflows, and we will enable new ones using streaming.  We can also log a warning about metadata being dropped.
** Long-term, we can improve these streaming workflows by maintaining partial metadata.

* Option 3 (fixed-length vectors): good if we're careful
** I think we can avoid breaking changes here.
** Note that fixed-length vectors are sort-of required already since ML attributes for Vector columns assume fixed lengths.  I've also never heard of the need for variable lengths.
** We can provide a method (or better, a Transformer) which adds metadata to a column.  (It could just be for specifying vector length for now, not a general metadata utility.)
** Current MLlib workflows should not require this; they are either batch or they already have metadata.
** New MLlib workflows using Streaming without metadata will be enabled when users add this Transformer to their workflow.

Long-term, it'd be great to support a sparse metadata format. Assuming we want to keep metadata around (and I think we should because it's really useful, e.g., for providing parity with R models by tracking feature names), then this seems like the best option for fixing these several issues around metadata.

What do you think?

> Update VectorAssembler to work with Structured Streaming
> --------------------------------------------------------
>
>                 Key: SPARK-22346
>                 URL: https://issues.apache.org/jira/browse/SPARK-22346
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, Structured Streaming
>    Affects Versions: 2.2.0
>            Reporter: Bago Amirbekian
>            Priority: Critical
>
> The issue
> In batch mode, VectorAssembler can take multiple columns of VectorType and assemble a output a new column of VectorType containing the concatenated vectors. In streaming mode, this transformation can fail because VectorAssembler does not have enough information to produce metadata (AttributeGroup) for the new column. Because VectorAssembler is such a ubiquitous part of mllib pipelines, this issue effectively means spark structured streaming does not support prediction using mllib pipelines.
> I've created this ticket so we can discuss ways to potentially improve VectorAssembler. Please let me know if there are any issues I have not considered or potential fixes I haven't outlined. I'm happy to submit a patch once I know which strategy is the best approach.
> Potential fixes
> 1) Replace VectorAssembler with an estimator/model pair like was recently done with OneHotEncoder, [SPARK-13030|https://issues.apache.org/jira/browse/SPARK-13030]. The Estimator can "learn" the size of the inputs vectors during training and save it to use during prediction.
> Pros:
> * Possibly simplest of the potential fixes
> Cons:
> * We'll need to deprecate current VectorAssembler
> 2) Drop the metadata (ML Attributes) from Vector columns. This is pretty major change, but it could be done in stages. We could first ensure that metadata is not used during prediction and allow the VectorAssembler to drop metadata for streaming dataframes. Going forward, it would be important to not use any metadata on Vector columns for any prediction tasks.
> Pros:
> * Potentially, easy short term fix for VectorAssembler
> (drop metadata for vector columns in streaming).
> * Current Attributes implementation is also causing other issues, eg [SPARK-19141|https://issues.apache.org/jira/browse/SPARK-19141].
> Cons:
> * To fully remove ML Attributes would be a major refactor of MLlib and would most likely require breaking changings.
> * A partial removal of ML attributes (eg: ensure ML attributes are not used during transform, only during fit) might be tricky. This would require testing or other enforcement mechanism to prevent regressions.
> 3) Require Vector columns to have fixed length vectors. Most mllib transformers that produce vectors already include the size of the vector in the column metadata. This change would be to deprecate APIs that allow creating a vector column of unknown length and replace those APIs with equivalents that enforce a fixed size.
> Pros:
> * We already treat vectors as fixed size, for example VectorAssembler assumes the inputs * output col are fixed size vectors and creates metadata accordingly. In the spirit of explicit is better than implicit, we would be codifying something we already assume.
> * This could potentially enable performance optimizations that are only possible if the Vector size of a column is fixed & known.
> Cons:
> * This would require breaking changes.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org