You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/04/21 03:10:59 UTC

[jira] [Updated] (SPARK-5995) Make ML Prediction Developer APIs public

     [ https://issues.apache.org/jira/browse/SPARK-5995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley updated SPARK-5995:
-------------------------------------
    Description: 
Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789]  There are ongoing discussions about the best design of the API.  This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public.

Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design.

Some issues under debate:
* Should there be strongly typed APIs for fit()?
** Proposal: No
* Should the strongly typed API for transform() be public (vs. protected)?
** Proposal: Protected for now
* What transformation methods should the API make developers implement for classification?
** Proposal: See design doc
* Should there be a way to transform a single Row (instead of only DataFrames)?
** Proposal: Not for now

  was:
Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789]  There are ongoing discussions about the best design of the API.  This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public.

Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design.

Some issues under debate:
* Should there be strongly typed APIs for fit()?
* Should the strongly typed API for transform() be public (vs. protected)?
* What transformation methods should the API make developers implement for classification?  (See details below.)
* Should there be a way to transform a single Row (instead of only DataFrames)?

More on "What transformation methods should the API make developers implement for classification?":
* Goals:
** Optimize transform: Make it fast, and make it output only the desired columns.
** Easy development
** Support Classifier, Regressor, and ProbabilisticClassifier
* (currently) Developers implement predictX methods for each output column X.  They may override transform() to optimize speed.
** Pros: predictX is easy to understand.
** Cons: An optimized transform() is annoying to write.
* Developers implement more basic transformation methods, such as features2raw, raw2pred, raw2prob.
** Pros: Abstract classes may implement optimized transform().
** Cons: Different types of predictors require different methods:
*** Predictor and Regressor: features2pred
*** Classifier: features2raw, raw2pred
*** ProbabilisticClassifier: raw2prob
* Developers implement a single predict() method which takes parameters for what columns to output (returning tuple or some type with None for missing values).  Abstract classes take the outputs they want and put them into columns.
** Pros: Developers only write 1 method and can optimize it as much as they want.  It could be more optimized than the previous 2 options; e.g., if LogisticRegressionModel only wants the prediction, then it never has to construct intermediate results such as the vector of raw predictions.
** Cons: predict() will have a different signature for different abstractions, based on the possible output columns.



> Make ML Prediction Developer APIs public
> ----------------------------------------
>
>                 Key: SPARK-5995
>                 URL: https://issues.apache.org/jira/browse/SPARK-5995
>             Project: Spark
>          Issue Type: Sub-task
>          Components: ML
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: Joseph K. Bradley
>
> Previously, some Developer APIs were added to spark.ml for classification and regression to make it easier to add new algorithms and models: [SPARK-4789]  There are ongoing discussions about the best design of the API.  This JIRA is to continue that discussion and try to finalize those Developer APIs so that they can be made public.
> Please see [this design doc from SPARK-4789 | https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs] for details on the original API design.
> Some issues under debate:
> * Should there be strongly typed APIs for fit()?
> ** Proposal: No
> * Should the strongly typed API for transform() be public (vs. protected)?
> ** Proposal: Protected for now
> * What transformation methods should the API make developers implement for classification?
> ** Proposal: See design doc
> * Should there be a way to transform a single Row (instead of only DataFrames)?
> ** Proposal: Not for now



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org