You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/07/17 01:36:04 UTC

[jira] [Commented] (SPARK-9084) Add in support for realtime data predictions using ML PipelineModel

    [ https://issues.apache.org/jira/browse/SPARK-9084?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14630529#comment-14630529 ] 

Joseph K. Bradley commented on SPARK-9084:
------------------------------------------

This has definitely been discussed, but is not on the roadmap right now.  I don't think MLlib should add its own concept of a row-with-schema; that needs to be added to Spark SQL / DataFrames.  The current plan is to support this type of prediction via:
* Spark Streaming DataFrames: This is on the roadmap but probably not happening in 1.5.  It can be hacked together currently, but it will ideally get some official support in the future.
* ML model export: We are increasing support for exporting models via PMML, after which the model can be imported into other tools.

I'm going to close this for now, but if you are passionate about this issue, you should re-open it for Spark SQL / DataFrames (though there may already be JIRAs for rows with schema, so please search first).

> Add in support for realtime data predictions using ML PipelineModel
> -------------------------------------------------------------------
>
>                 Key: SPARK-9084
>                 URL: https://issues.apache.org/jira/browse/SPARK-9084
>             Project: Spark
>          Issue Type: New Feature
>          Components: ML
>            Reporter: Hollin Wilkins
>            Priority: Minor
>
> Currently ML provides excellent support for feature manipulation, model selection, and prediction for large datasets. The models can all be easily serialized but currently it is not possible to use the fitted models without a DataFrame. This means that these models are only good for batch processing. In order to support realtime ML pipelines, I propose adding in three new methods to the Transformer class:
> def transform(row: StructuredRow): StructuredRow
> def transform(row: StructuredRow, paramMap: ParamMap): StructuredRow
> def transform(row: StructuredRow, firstParamPair: ParamPair[_], otherParamPairs: ParamPair[_]*): StructuredRow
> Where a StructuredRow is a case class that is the combination of an org.apache.spark.sql.Row and an org.apache.spark.sql.types.StructType. An alternative would be to modify the transform method signature to take in two objects, a StructType and a Row.
> This change necessitates the addition of the new transform method to each implementor of the Transformer class.
> Following this change, it would be trivial to include the spark jars in an API server, deserialize an ML PipelineModel object, take incoming data from users, convert it into a StructuredRow and feed it into the PipelineModel to get a realtime result.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org