You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Joseph K. Bradley (JIRA)" <ji...@apache.org> on 2015/08/03 02:31:04 UTC

[jira] [Resolved] (SPARK-6683) Handling feature scaling properly for GLMs

     [ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joseph K. Bradley resolved SPARK-6683.
--------------------------------------
       Resolution: Duplicate
    Fix Version/s: 1.5.0

> Handling feature scaling properly for GLMs
> ------------------------------------------
>
>                 Key: SPARK-6683
>                 URL: https://issues.apache.org/jira/browse/SPARK-6683
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: DB Tsai
>             Fix For: 1.5.0
>
>
> GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
> * improves optimization behavior (essentially always improves behavior in practice)
> * changes the optimal solution (often for the better in terms of standardizing feature importance)
> Current problems:
> * Inefficient implementation: We make a rescaled copy of the data.
> * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries.  (Note: Feature scaling could be handled without changing the solution.)
> * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.
> This is a proposal discussed with [~mengxr] for an "ideal" solution.  This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.
> Proposal:
> * Implementation: Change to avoid making a rescaled copy of the data (described below).  No API issues here.
> * API:
> ** Hide featureScaling from API. (breaking change)
> ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
> ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.
> Details on implementation:
> * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above).  This would require storing a vector of length numFeatures, rather than making a full copy of the data.
> * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org