You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Yanbo Liang (JIRA)" <ji...@apache.org> on 2016/04/06 08:35:25 UTC

[jira] [Comment Edited] (SPARK-14311) Model persistence in SparkR

    [ https://issues.apache.org/jira/browse/SPARK-14311?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15227811#comment-15227811 ] 

Yanbo Liang edited comment on SPARK-14311 at 4/6/16 6:34 AM:
-------------------------------------------------------------

When I worked at SPARK-14313, I found that we can easily support ml.save but difficultly to support ml.load in the current framework.
Because ml.load will return one type of class rather than different types for different models. So we can support ml.load return a S4 class “PipelineModel", but it does not meet the new SparkRWrapper framework which wrapper different S4 class for different MLlib models.
I also found that the R h2o package has the unified H2OModel object for all models, and it’s more easy to support save/load. http://finzi.psych.upenn.edu/library/h2o/html/h2o.loadModel.html
Should we make ml.load return a unified S4 class “PipelineModel”? If yes, we should make refactor which defines different model wrappers as unified S4 classes “PipelineModel" firstly. But it looks like we fall back to the old SparkRWrapper framework.
I’m looking forward to hear your thoughts. [~mengxr] 


was (Author: yanboliang):
When I worked at SPARK-14313, I found that we can easily support ml.save but difficultly to support ml.load in the current framework.
Because ml.load will return one type of class rather than different types for different models. So we can support ml.load return a S4 class “PipelineModel", but it does not meet the new SparkRWrapper framework which wrapper different S4 class for different MLlib models.
I also found that the R h2o package has the unified H2OModel object for all models, and it’s more easy to support save/load. http://finzi.psych.upenn.edu/library/h2o/html/h2o.loadModel.html
Should we make ml.load return a unified S4 class “PipelineModel”? If yes, we should make refactor which defines different model wrappers as unified S4 classes “PipelineModel" firstly. I’m looking forward to hear your thoughts. [~mengxr] 

> Model persistence in SparkR
> ---------------------------
>
>                 Key: SPARK-14311
>                 URL: https://issues.apache.org/jira/browse/SPARK-14311
>             Project: Spark
>          Issue Type: Umbrella
>          Components: ML, SparkR
>            Reporter: Xiangrui Meng
>            Assignee: Xiangrui Meng
>
> In Spark 2.0, we are going to have 4 ML models in SparkR: GLMs, k-means, naive Bayes, and AFT survival regression. Users can fit models, get summary, and make predictions. However, they cannot save/load the models yet.
> ML models in SparkR are wrappers around ML pipelines. So it should be straightforward to implement model persistence. We need to think more about the API. R uses save/load for objects and datasets (also objects). It is possible to overload save for ML models, e.g., save.NaiveBayesWrapper. But I'm not sure whether load can be overloaded easily. I propose the following API:
> {code}
> model <- glm(formula, data = df)
> ml.save(model, path, mode = "overwrite")
> model2 <- ml.load(path)
> {code}
> We defined wrappers as S4 classes. So `ml.save` is an S4 method and ml.load is a S3 method (correct me if I'm wrong).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org