You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@spark.apache.org by "Xiangrui Meng (JIRA)" <ji...@apache.org> on 2014/05/18 19:27:38 UTC

[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions with SGD yield unexpected results

    [ https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14001136#comment-14001136 ] 

Xiangrui Meng commented on SPARK-1859:
--------------------------------------

The step size should be smaller than the Lipschitz constant L. Your example contains a term 0.5 * (1500 * w - 2400)^2, whose Hessian is 1500 * 1500. To make it converge, you need to set step size smaller than (1.0/1500/1500). Yes, it looks like a simple problem, but it is actually ill-conditioned.

scikit-learn may use line search or directly solve the least square problem, while we didn't implement line search in LinearRegressionWithSGD. You can try LBFGS in the current master, which should work for your example.

> Linear, Ridge and Lasso Regressions with SGD yield unexpected results
> ---------------------------------------------------------------------
>
>                 Key: SPARK-1859
>                 URL: https://issues.apache.org/jira/browse/SPARK-1859
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib
>    Affects Versions: 0.9.1
>         Environment: OS: Ubuntu Server 12.04 x64
> PySpark
>            Reporter: Vlad Frolov
>              Labels: algorithm, machine_learning, regression
>
> Issue:
> Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one).
> Ridge Regression with SGD *sometimes* works ok.
> Lasso Regression with SGD *sometimes* works ok.
> Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
> {code:title=regression_example.py}
> parsedData = sc.parallelize([
>     array([2400., 1500.]),
>     array([240., 150.]),
>     array([24., 15.]),
>     array([2.4, 1.5]),
>     array([0.24, 0.15])
> ])
> # Build the model
> model = LinearRegressionWithSGD.train(parsedData)
> print model._coeffs
> {code}
> So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :)
> The resulting model has nan coeffs: {{array([ nan])}}.
> Furthermore, if you comment records line by line you will get:
> * [-1.55897475e+296] coeff (the first record is commented), 
> * [-8.62115396e+104] coeff (the first two records are commented),
> * etc
> It looks like the implemented regression algorithms diverges somehow.
> I get almost the same results on Ridge and Lasso.
> I've also tested these inputs in scikit-learn and it works as expected there.
> However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow?



--
This message was sent by Atlassian JIRA
(v6.2#6252)