You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "DB Tsai (JIRA)" <ji...@apache.org> on 2015/08/03 01:17:04 UTC

[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs

    [ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14651269#comment-14651269 ] 

DB Tsai commented on SPARK-6683:
--------------------------------

I think we can close this one. They are addressed in https://issues.apache.org/jira/browse/SPARK-8522

> Handling feature scaling properly for GLMs
> ------------------------------------------
>
>                 Key: SPARK-6683
>                 URL: https://issues.apache.org/jira/browse/SPARK-6683
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>            Assignee: DB Tsai
>
> GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
> * improves optimization behavior (essentially always improves behavior in practice)
> * changes the optimal solution (often for the better in terms of standardizing feature importance)
> Current problems:
> * Inefficient implementation: We make a rescaled copy of the data.
> * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries.  (Note: Feature scaling could be handled without changing the solution.)
> * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.
> This is a proposal discussed with [~mengxr] for an "ideal" solution.  This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.
> Proposal:
> * Implementation: Change to avoid making a rescaled copy of the data (described below).  No API issues here.
> * API:
> ** Hide featureScaling from API. (breaking change)
> ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
> ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.
> Details on implementation:
> * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above).  This would require storing a vector of length numFeatures, rather than making a full copy of the data.
> * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org