You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Vladimir Feinberg (JIRA)" <ji...@apache.org> on 2016/07/11 19:02:11 UTC

[jira] [Commented] (SPARK-10931) PySpark ML Models should contain Param values

    [ https://issues.apache.org/jira/browse/SPARK-10931?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15371420#comment-15371420 ] 

Vladimir Feinberg commented on SPARK-10931:
-------------------------------------------

[~josephkb] Te intention of this JIRA is a bit confusing. To my understanding, there are three kinds of params:

1. Estimator-related params that only have to do with fitting (e.g., regularization)
2. Independent model and estimator-related params to do with prediction (e.g., number of maximum iterations)
3. Shared model and estimator params that are set once per fitted pipeline (e.g., number of components in PCA).

I'd venture that we'd want a model to have:

1. Access to an immutable version of (1) and (3).
  * In Scala, this is done by having a {{parent}} reference to the generating {{Estimator}}, but this is a reference, so if the estimator changes then the params will, too, inconsistent with the model. It should be copy-on-write (this may be SPARK-7494, I'm not sure). Also, {{parent}} is a mutable reference.
  * In Python, there is no {{parent}}

2. Access to a mutable version of (2), where mutation should change model behavior
  * Both languages have this.

3. Separation of concerns. If a parameter falls into categories (1) or (3), it shouldn't be a parameter for the model, since changing its value has no effect except confusion
  * Both Python and Scala will, as of this JIRA, copy everything - groups (1), (2), (3) - to the model, each with its own version.

> PySpark ML Models should contain Param values
> ---------------------------------------------
>
>                 Key: SPARK-10931
>                 URL: https://issues.apache.org/jira/browse/SPARK-10931
>             Project: Spark
>          Issue Type: Improvement
>          Components: ML, PySpark
>            Reporter: Joseph K. Bradley
>
> PySpark spark.ml Models are generally wrappers around Java objects and do not even contain Param values.  This JIRA is for copying the Param values from the Estimator to the model.
> This can likely be solved by modifying Estimator.fit to copy Param values, but should also include proper unit tests.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org