You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "DB Tsai (JIRA)" <ji...@apache.org> on 2015/04/03 23:13:53 UTC
[jira] [Commented] (SPARK-6683) Handling feature scaling properly for GLMs

    [ https://issues.apache.org/jira/browse/SPARK-6683?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14395088#comment-14395088 ] 

DB Tsai commented on SPARK-6683:
--------------------------------

I have this implemented in our lab including
handling the intercept without adding bias in the training dataset
which improves the performance a lot without doing extra caching.

In logistic regression, since the objective function is sum of logP
which is invariance under transformation, this implies that instead of
rescaling x, we can get the same result by rescaling the gradient. As
a result, this can be done just right before optimization.

However, in linear regression, the objective value will be changed
under transformation as well, so I need to handle them differently.

As a result, it will be changelling to come out with one framework
which works for all different type of generalized linear models.

I will like to have them implemented differently in each new SparkML
codebase instead of sharing the same GLM base class. What do you
think?


> Handling feature scaling properly for GLMs
> ------------------------------------------
>
>                 Key: SPARK-6683
>                 URL: https://issues.apache.org/jira/browse/SPARK-6683
>             Project: Spark
>          Issue Type: Improvement
>          Components: MLlib
>    Affects Versions: 1.3.0
>            Reporter: Joseph K. Bradley
>
> GeneralizedLinearAlgorithm can scale features.  This has 2 effects:
> * improves optimization behavior (essentially always improves behavior in practice)
> * changes the optimal solution (often for the better in terms of standardizing feature importance)
> Current problems:
> * Inefficient implementation: We make a rescaled copy of the data.
> * Surprising API: For algorithms which use feature scaling, users may get different solutions than with R or other libraries.  (Note: Feature scaling could be handled without changing the solution.)
> * Inconsistent API: Not all algorithms have the same default for feature scaling, and not all expose the option.
> This is a proposal discussed with [~mengxr] for an "ideal" solution.  This solution will require some breaking API changes, but I'd argue these are necessary for the long-term since it's the best API we have thought of.
> Proposal:
> * Implementation: Change to avoid making a rescaled copy of the data (described below).  No API issues here.
> * API:
> ** Hide featureScaling from API. (breaking change)
> ** Internally, handle feature scaling to improve optimization, but modify it so it does not change the optimal solution. (breaking change, in terms of algorithm behavior)
> ** Externally, users who want to rescale feature (to change the solution) should do that scaling as a preprocessing step.
> Details on implementation:
> * GradientDescent could instead scale the step size separately for each feature (and adjust regularization as needed; see the PR linked above).  This would require storing a vector of length numFeatures, rather than making a full copy of the data.
> * I haven't thought this through for LBFGS, but I hope [~dbtsai] can weigh in here.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org