You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Dongjoon Hyun (Jira)" <ji...@apache.org> on 2020/03/16 22:52:05 UTC

[jira] [Updated] (SPARK-24359) SPIP: ML Pipelines in R

     [ https://issues.apache.org/jira/browse/SPARK-24359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Dongjoon Hyun updated SPARK-24359:
----------------------------------
    Affects Version/s:     (was: 3.0.0)
                       3.1.0

> SPIP: ML Pipelines in R
> -----------------------
>
>                 Key: SPARK-24359
>                 URL: https://issues.apache.org/jira/browse/SPARK-24359
>             Project: Spark
>          Issue Type: Improvement
>          Components: SparkR
>    Affects Versions: 3.1.0
>            Reporter: Hossein Falaki
>            Priority: Major
>              Labels: SPIP
>         Attachments: SparkML_ ML Pipelines in R-v2.pdf, SparkML_ ML Pipelines in R-v3.pdf, SparkML_ ML Pipelines in R.pdf
>
>
> h1. Background and motivation
> SparkR supports calling MLlib functionality with an [R-friendly API|https://docs.google.com/document/d/10NZNSEurN2EdWM31uFYsgayIPfCFHiuIu3pCWrUmP_c/]. Since Spark 1.5 the (new) SparkML API which is based on [pipelines and parameters|https://docs.google.com/document/d/1rVwXRjWKfIb-7PI6b86ipytwbUH7irSNLF1_6dLmh8o] has matured significantly. It allows users build and maintain complicated machine learning pipelines. A lot of this functionality is difficult to expose using the simple formula-based API in SparkR.
> We propose a new R package, _SparkML_, to be distributed along with SparkR as part of Apache Spark. This new package will be built on top of SparkR’s APIs to expose SparkML’s pipeline APIs and functionality.
> *Why not SparkR?*
> SparkR package contains ~300 functions. Many of these shadow functions in base and other popular CRAN packages. We think adding more functions to SparkR will degrade usability and make maintenance harder.
> *Why not sparklyr?*
> sparklyr is an R package developed by RStudio Inc. to expose Spark API to R users. sparklyr includes MLlib API wrappers, but to the best of our knowledge they are not comprehensive. Also we propose a code-gen approach for this package to minimize work needed to expose future MLlib API, but sparklyr’s API is manually written.
> h1. Target Personas
>  * Existing SparkR users who need more flexible SparkML API
>  * R users (data scientists, statisticians) who wish to build Spark ML pipelines in R
> h1. Goals
>  * R users can install SparkML from CRAN
>  * R users will be able to import SparkML independent from SparkR
>  * After setting up a Spark session R users can
>  ** create a pipeline by chaining individual components and specifying their parameters
>  ** tune a pipeline in parallel, taking advantage of Spark
>  ** inspect a pipeline’s parameters and evaluation metrics
>  ** repeatedly apply a pipeline
>  * MLlib contributors can easily add R wrappers for new MLlib Estimators and Transformers
> h1. Non-Goals
>  * Adding new algorithms to SparkML R package which do not exist in Scala
>  * Parallelizing existing CRAN packages
>  * Changing existing SparkR ML wrapping API
> h1. Proposed API Changes
> h2. Design goals
> When encountering trade-offs in API, we will chose based on the following list of priorities. The API choice that addresses a higher priority goal will be chosen.
>  # *Comprehensive coverage of MLlib API:* Design choices that make R coverage of future ML algorithms difficult will be ruled out.
>  * *Semantic clarity*: We attempt to minimize confusion with other packages. Between consciousness and clarity, we will choose clarity.
>  * *Maintainability and testability:* API choices that require manual maintenance or make testing difficult should be avoided.
>  * *Interoperability with rest of Spark components:* We will keep the R API as thin as possible and keep all functionality implementation in JVM/Scala.
>  * *Being natural to R users:* Ultimate users of this package are R users and they should find it easy and natural to use.
> The API will follow familiar R function syntax, where the object is passed as the first argument of the method:  do_something(obj, arg1, arg2). All functions are snake_case (e.g., {{spark_logistic_regression()}} and {{set_max_iter()}}). If a constructor gets arguments, they will be named arguments. For example:
> {code:java}
> > lr <- set_reg_param(set_max_iter(spark.logistic.regression()), 10), 0.1){code}
> When calls need to be chained, like above example, syntax can nicely translate to a natural pipeline style with help from very popular[ magrittr package|https://cran.r-project.org/web/packages/magrittr/index.html]. For example:
> {code:java}
> > logistic_regression() %>% set_max_iter(10) %>% set_reg_param(0.01) -> lr{code}
> h2. Namespace
> All new API will be under a new CRAN package, named SparkML. The package should be usable without needing SparkR in the namespace. The package will introduce a number of S4 classes that inherit from four basic classes. Here we will list the basic types with a few examples. An object of any child class can be instantiated with a function call that starts with {{spark_}}.
> h2. Pipeline & PipelineStage
> A pipeline object contains one or more stages.  
> {code:java}
> > pipeline <- spark_pipeline() %>% set_stages(stage1, stage2, stage3){code}
> Where stage1, stage2, etc are S4 objects of a PipelineStage and pipeline is an object of type Pipeline.
> h2. Transformers
> A Transformer is an algorithm that can transform one SparkDataFrame into another SparkDataFrame.
> *Example API:*
> {code:java}
> > tokenizer <- spark_tokenizer() %>%
>             set_input_col(“text”) %>%
>             set_output_col(“words”)
> > tokenized.df <- tokenizer %>% transform(df) {code}
> h2. Estimators
> An Estimator is an algorithm which can be fit on a SparkDataFrame to produce a Transformer. E.g., a learning algorithm is an Estimator which trains on a DataFrame and produces a model.
> *Example API:*
> {code:java}
> lr <- spark_logistic_regression() %>%
>             set_max_iter(10) %>%
>             set_reg_param(0.001){code}
> h2. Evaluators
> An evaluator computes metrics from predictions (model outputs) and returns a scalar metric.
> *Example API:*
> {code:java}
> lr.eval <- spark_regression_evaluator(){code}
> h2. Miscellaneous Classes
> MLlib pipelines have a variety of miscellaneous classes that serve as helpers and utilities. For example an object of ParamGridBuilder is used to build a grid search pipeline. Another example is ClusteringSummary.
> *Example API:*
> {code:java}
> > grid <- param_grid_builder() %>%
>             add_grid(reg_param(lr), c(0.1, 0.01)) %>%
>             add_grid(fit_intercept(lr), c(TRUE, FALSE)) %>%
>             add_grid(elastic_net_param(lr), c(0.0, 0.5, 1.0))
>  > model <- train_validation_split() %>%
>             set_estimator(lr) %>%
>             set_evaluator(spark_regression_evaluator()) %>%
>             set_estimator_param_maps(grid) %>%
>             set_train_ratio(0.8) %>%
>             set_parallelism(2) %>%
>             fit(){code}
> Pipeline Persistence
> SparkML package will fix a longstanding issue with SparkR model persistence SPARK-15572. SparkML will directly wrap MLlib pipeline persistence API. 
> *API example:*
> {code:java}
> > model <- pipeline %>% fit(training)
> > model %>% spark_write_pipeline(overwrite = TRUE, path = “...”){code}
> h1. Design Sketch
> We propose using code generation from Scala to produce comprehensive API wrappers in R. For more details please see the attached design document.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org