You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@spark.apache.org by "Albert Azout (JIRA)" <ji...@apache.org> on 2015/06/26 06:46:04 UTC
[jira] [Commented] (SPARK-1859) Linear, Ridge and Lasso Regressions
with SGD yield unexpected results
[ https://issues.apache.org/jira/browse/SPARK-1859?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14602388#comment-14602388 ]
Albert Azout commented on SPARK-1859:
-------------------------------------
Hi this is still an open issue for us. FYI. Any new resolutions on this?
> Linear, Ridge and Lasso Regressions with SGD yield unexpected results
> ---------------------------------------------------------------------
>
> Key: SPARK-1859
> URL: https://issues.apache.org/jira/browse/SPARK-1859
> Project: Spark
> Issue Type: Bug
> Components: MLlib
> Affects Versions: 0.9.1
> Environment: OS: Ubuntu Server 12.04 x64
> PySpark
> Reporter: Vlad Frolov
> Labels: algorithm, machine_learning, regression
>
> Issue:
> Linear Regression with SGD don't work as expected on any data, but lpsa.dat (example one).
> Ridge Regression with SGD *sometimes* works ok.
> Lasso Regression with SGD *sometimes* works ok.
> Code example (PySpark) based on http://spark.apache.org/docs/0.9.0/mllib-guide.html#linear-regression-2 :
> {code:title=regression_example.py}
> parsedData = sc.parallelize([
> array([2400., 1500.]),
> array([240., 150.]),
> array([24., 15.]),
> array([2.4, 1.5]),
> array([0.24, 0.15])
> ])
> # Build the model
> model = LinearRegressionWithSGD.train(parsedData)
> print model._coeffs
> {code}
> So we have a line ({{f(X) = 1.6 * X}}) here. Fortunately, {{f(X) = X}} works! :)
> The resulting model has nan coeffs: {{array([ nan])}}.
> Furthermore, if you comment records line by line you will get:
> * [-1.55897475e+296] coeff (the first record is commented),
> * [-8.62115396e+104] coeff (the first two records are commented),
> * etc
> It looks like the implemented regression algorithms diverges somehow.
> I get almost the same results on Ridge and Lasso.
> I've also tested these inputs in scikit-learn and it works as expected there.
> However, I'm still not sure whether it's a bug or SGD 'feature'. Should I preprocess my datasets somehow?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org