You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@spark.apache.org by Eustache DIEMERT <eu...@diemert.fr> on 2014/07/02 16:11:39 UTC

[mllib] strange/buggy results with RidgeRegressionWithSGD

Hi list,

I'm benchmarking MLlib for a regression task [1] and get strange results.

Namely, using RidgeRegressionWithSGD it seems the predicted points miss the
intercept:

{code}
val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
...
valuesAndPreds.take(10).map(t => println(t))
{code}

output:

(2007.0,-3.784588726958493E75)
(2003.0,-1.9562390324037716E75)
(2005.0,-4.147413202985629E75)
(2003.0,-1.524938024096847E75)
...

If I change the parameters (step size, regularization and iterations) I get
NaNs more often than not:
(2007.0,NaN)
(2003.0,NaN)
(2005.0,NaN)
...

On the other hand DecisionTree model give sensible results.

I see there is a `setIntercept()` method in abstract class
GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
but I'm unable to use it from the public interface :(

Any help appreciated :)

Eustache

[1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

Posted by Eustache DIEMERT <eu...@diemert.fr>.
Ok, I've tried to add the intercept term myself (code here [1]), but with
no luck.

It seems that adding a column of ones doesn't help with convergence either.

I may have missed something in the coding as I'm quite a noob in Scala, but
printing the data seem to indicate I succeeded in adding the ones column.

Does anyone here has had success with this code on real-world datasets ?

[1] https://github.com/oddskool/mllib-samples/tree/ridge (in the ridge
branch)




2014-07-07 9:08 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:

> Well, why not, but IMHO MLLib Logistic Regression is unusable right now.
> The inability to use intercept is just a no-go. I could hack a column of
> ones to inject the intercept into the data but frankly it's a pithy to have
> to do so.
>
>
> 2014-07-05 23:04 GMT+02:00 DB Tsai <db...@dbtsai.com>:
>
> You may try LBFGS to have more stable convergence. In spark 1.1, we will
>> be able to use LBFGS instead of GD in training process.
>> On Jul 4, 2014 1:23 PM, "Thomas Robert" <th...@creativedata.fr> wrote:
>>
>>> Hi all,
>>>
>>> I too am having some issues with *RegressionWithSGD algorithms.
>>>
>>> Concerning your issue Eustache, this could be due to the fact that these
>>> regression algorithms uses a fixed step (that is divided by
>>> sqrt(iteration)). During my tests, quite often, the algorithm diverged an
>>> infinity cost, I guessed because the step was too big. I reduce it and
>>> managed to get good results on a very simple generated dataset.
>>>
>>> But I was wondering if anyone here had some advises concerning the use
>>> of these regression algorithms, for example how to choose a good step and
>>> number of iterations? I wonder if I'm using those right...
>>>
>>> Thanks,
>>>
>>> --
>>>
>>> *Thomas ROBERT*
>>> www.creativedata.fr
>>>
>>>
>>> 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>>>
>>>> Printing the model show the intercept is always 0 :(
>>>>
>>>> Should I open a bug for that ?
>>>>
>>>>
>>>> 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>>>>
>>>>> Hi list,
>>>>>
>>>>> I'm benchmarking MLlib for a regression task [1] and get strange
>>>>> results.
>>>>>
>>>>> Namely, using RidgeRegressionWithSGD it seems the predicted points
>>>>> miss the intercept:
>>>>>
>>>>> {code}
>>>>> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
>>>>> ...
>>>>> valuesAndPreds.take(10).map(t => println(t))
>>>>> {code}
>>>>>
>>>>> output:
>>>>>
>>>>> (2007.0,-3.784588726958493E75)
>>>>> (2003.0,-1.9562390324037716E75)
>>>>> (2005.0,-4.147413202985629E75)
>>>>> (2003.0,-1.524938024096847E75)
>>>>> ...
>>>>>
>>>>> If I change the parameters (step size, regularization and iterations)
>>>>> I get NaNs more often than not:
>>>>> (2007.0,NaN)
>>>>> (2003.0,NaN)
>>>>> (2005.0,NaN)
>>>>> ...
>>>>>
>>>>> On the other hand DecisionTree model give sensible results.
>>>>>
>>>>> I see there is a `setIntercept()` method in abstract class
>>>>> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
>>>>> but I'm unable to use it from the public interface :(
>>>>>
>>>>> Any help appreciated :)
>>>>>
>>>>> Eustache
>>>>>
>>>>> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>>>>>
>>>>
>>>
>

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

Posted by Eustache DIEMERT <eu...@diemert.fr>.
Well, why not, but IMHO MLLib Logistic Regression is unusable right now.
The inability to use intercept is just a no-go. I could hack a column of
ones to inject the intercept into the data but frankly it's a pithy to have
to do so.


2014-07-05 23:04 GMT+02:00 DB Tsai <db...@dbtsai.com>:

> You may try LBFGS to have more stable convergence. In spark 1.1, we will
> be able to use LBFGS instead of GD in training process.
> On Jul 4, 2014 1:23 PM, "Thomas Robert" <th...@creativedata.fr> wrote:
>
>> Hi all,
>>
>> I too am having some issues with *RegressionWithSGD algorithms.
>>
>> Concerning your issue Eustache, this could be due to the fact that these
>> regression algorithms uses a fixed step (that is divided by
>> sqrt(iteration)). During my tests, quite often, the algorithm diverged an
>> infinity cost, I guessed because the step was too big. I reduce it and
>> managed to get good results on a very simple generated dataset.
>>
>> But I was wondering if anyone here had some advises concerning the use of
>> these regression algorithms, for example how to choose a good step and
>> number of iterations? I wonder if I'm using those right...
>>
>> Thanks,
>>
>> --
>>
>> *Thomas ROBERT*
>> www.creativedata.fr
>>
>>
>> 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>>
>>> Printing the model show the intercept is always 0 :(
>>>
>>> Should I open a bug for that ?
>>>
>>>
>>> 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>>>
>>>> Hi list,
>>>>
>>>> I'm benchmarking MLlib for a regression task [1] and get strange
>>>> results.
>>>>
>>>> Namely, using RidgeRegressionWithSGD it seems the predicted points miss
>>>> the intercept:
>>>>
>>>> {code}
>>>> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
>>>> ...
>>>> valuesAndPreds.take(10).map(t => println(t))
>>>> {code}
>>>>
>>>> output:
>>>>
>>>> (2007.0,-3.784588726958493E75)
>>>> (2003.0,-1.9562390324037716E75)
>>>> (2005.0,-4.147413202985629E75)
>>>> (2003.0,-1.524938024096847E75)
>>>> ...
>>>>
>>>> If I change the parameters (step size, regularization and iterations) I
>>>> get NaNs more often than not:
>>>> (2007.0,NaN)
>>>> (2003.0,NaN)
>>>> (2005.0,NaN)
>>>> ...
>>>>
>>>> On the other hand DecisionTree model give sensible results.
>>>>
>>>> I see there is a `setIntercept()` method in abstract class
>>>> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
>>>> but I'm unable to use it from the public interface :(
>>>>
>>>> Any help appreciated :)
>>>>
>>>> Eustache
>>>>
>>>> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>>>>
>>>
>>

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

Posted by DB Tsai <db...@dbtsai.com>.
You may try LBFGS to have more stable convergence. In spark 1.1, we will be
able to use LBFGS instead of GD in training process.
On Jul 4, 2014 1:23 PM, "Thomas Robert" <th...@creativedata.fr> wrote:

> Hi all,
>
> I too am having some issues with *RegressionWithSGD algorithms.
>
> Concerning your issue Eustache, this could be due to the fact that these
> regression algorithms uses a fixed step (that is divided by
> sqrt(iteration)). During my tests, quite often, the algorithm diverged an
> infinity cost, I guessed because the step was too big. I reduce it and
> managed to get good results on a very simple generated dataset.
>
> But I was wondering if anyone here had some advises concerning the use of
> these regression algorithms, for example how to choose a good step and
> number of iterations? I wonder if I'm using those right...
>
> Thanks,
>
> --
>
> *Thomas ROBERT*
> www.creativedata.fr
>
>
> 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>
>> Printing the model show the intercept is always 0 :(
>>
>> Should I open a bug for that ?
>>
>>
>> 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>>
>>> Hi list,
>>>
>>> I'm benchmarking MLlib for a regression task [1] and get strange
>>> results.
>>>
>>> Namely, using RidgeRegressionWithSGD it seems the predicted points miss
>>> the intercept:
>>>
>>> {code}
>>> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
>>> ...
>>> valuesAndPreds.take(10).map(t => println(t))
>>> {code}
>>>
>>> output:
>>>
>>> (2007.0,-3.784588726958493E75)
>>> (2003.0,-1.9562390324037716E75)
>>> (2005.0,-4.147413202985629E75)
>>> (2003.0,-1.524938024096847E75)
>>> ...
>>>
>>> If I change the parameters (step size, regularization and iterations) I
>>> get NaNs more often than not:
>>> (2007.0,NaN)
>>> (2003.0,NaN)
>>> (2005.0,NaN)
>>> ...
>>>
>>> On the other hand DecisionTree model give sensible results.
>>>
>>> I see there is a `setIntercept()` method in abstract class
>>> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
>>> but I'm unable to use it from the public interface :(
>>>
>>> Any help appreciated :)
>>>
>>> Eustache
>>>
>>> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>>>
>>
>

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

Posted by Eustache DIEMERT <eu...@diemert.fr>.
I tried to adjust stepSize between 1e-4 and 1, it doesn't seem to be the
problem. Actually the problem is that the model doesn't use the intercept.
So what happens is that it tries to compensate with super heavy weights (>
1e40) and ends up overflowing the model coefficients. MSE is exploding too,
as a consequence.


2014-07-04 22:22 GMT+02:00 Thomas Robert <th...@creativedata.fr>:

> Hi all,
>
> I too am having some issues with *RegressionWithSGD algorithms.
>
> Concerning your issue Eustache, this could be due to the fact that these
> regression algorithms uses a fixed step (that is divided by
> sqrt(iteration)). During my tests, quite often, the algorithm diverged an
> infinity cost, I guessed because the step was too big. I reduce it and
> managed to get good results on a very simple generated dataset.
>
> But I was wondering if anyone here had some advises concerning the use of
> these regression algorithms, for example how to choose a good step and
> number of iterations? I wonder if I'm using those right...
>
> Thanks,
>
> --
>
> *Thomas ROBERT*
> www.creativedata.fr
>
>
> 2014-07-03 16:16 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>
> Printing the model show the intercept is always 0 :(
>>
>> Should I open a bug for that ?
>>
>>
>> 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>>
>>> Hi list,
>>>
>>> I'm benchmarking MLlib for a regression task [1] and get strange
>>> results.
>>>
>>> Namely, using RidgeRegressionWithSGD it seems the predicted points miss
>>> the intercept:
>>>
>>> {code}
>>> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
>>> ...
>>> valuesAndPreds.take(10).map(t => println(t))
>>> {code}
>>>
>>> output:
>>>
>>> (2007.0,-3.784588726958493E75)
>>> (2003.0,-1.9562390324037716E75)
>>> (2005.0,-4.147413202985629E75)
>>> (2003.0,-1.524938024096847E75)
>>> ...
>>>
>>> If I change the parameters (step size, regularization and iterations) I
>>> get NaNs more often than not:
>>> (2007.0,NaN)
>>> (2003.0,NaN)
>>> (2005.0,NaN)
>>> ...
>>>
>>> On the other hand DecisionTree model give sensible results.
>>>
>>> I see there is a `setIntercept()` method in abstract class
>>> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
>>> but I'm unable to use it from the public interface :(
>>>
>>> Any help appreciated :)
>>>
>>> Eustache
>>>
>>> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>>>
>>
>

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

Posted by Thomas Robert <th...@creativedata.fr>.
Hi all,

I too am having some issues with *RegressionWithSGD algorithms.

Concerning your issue Eustache, this could be due to the fact that these
regression algorithms uses a fixed step (that is divided by
sqrt(iteration)). During my tests, quite often, the algorithm diverged an
infinity cost, I guessed because the step was too big. I reduce it and
managed to get good results on a very simple generated dataset.

But I was wondering if anyone here had some advises concerning the use of
these regression algorithms, for example how to choose a good step and
number of iterations? I wonder if I'm using those right...

Thanks,

-- 

*Thomas ROBERT*
www.creativedata.fr


2014-07-03 16:16 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:

> Printing the model show the intercept is always 0 :(
>
> Should I open a bug for that ?
>
>
> 2014-07-02 16:11 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:
>
>> Hi list,
>>
>> I'm benchmarking MLlib for a regression task [1] and get strange results.
>>
>> Namely, using RidgeRegressionWithSGD it seems the predicted points miss
>> the intercept:
>>
>> {code}
>> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
>> ...
>> valuesAndPreds.take(10).map(t => println(t))
>> {code}
>>
>> output:
>>
>> (2007.0,-3.784588726958493E75)
>> (2003.0,-1.9562390324037716E75)
>> (2005.0,-4.147413202985629E75)
>> (2003.0,-1.524938024096847E75)
>> ...
>>
>> If I change the parameters (step size, regularization and iterations) I
>> get NaNs more often than not:
>> (2007.0,NaN)
>> (2003.0,NaN)
>> (2005.0,NaN)
>> ...
>>
>> On the other hand DecisionTree model give sensible results.
>>
>> I see there is a `setIntercept()` method in abstract class
>> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
>> but I'm unable to use it from the public interface :(
>>
>> Any help appreciated :)
>>
>> Eustache
>>
>> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>>
>

Re: [mllib] strange/buggy results with RidgeRegressionWithSGD

Posted by Eustache DIEMERT <eu...@diemert.fr>.
Printing the model show the intercept is always 0 :(

Should I open a bug for that ?


2014-07-02 16:11 GMT+02:00 Eustache DIEMERT <eu...@diemert.fr>:

> Hi list,
>
> I'm benchmarking MLlib for a regression task [1] and get strange results.
>
> Namely, using RidgeRegressionWithSGD it seems the predicted points miss
> the intercept:
>
> {code}
> val trainedModel = RidgeRegressionWithSGD.train(trainingData, 1000)
> ...
> valuesAndPreds.take(10).map(t => println(t))
> {code}
>
> output:
>
> (2007.0,-3.784588726958493E75)
> (2003.0,-1.9562390324037716E75)
> (2005.0,-4.147413202985629E75)
> (2003.0,-1.524938024096847E75)
> ...
>
> If I change the parameters (step size, regularization and iterations) I
> get NaNs more often than not:
> (2007.0,NaN)
> (2003.0,NaN)
> (2005.0,NaN)
> ...
>
> On the other hand DecisionTree model give sensible results.
>
> I see there is a `setIntercept()` method in abstract class
> GeneralizedLinearAlgorithm that seems to trigger the use of the intercept
> but I'm unable to use it from the public interface :(
>
> Any help appreciated :)
>
> Eustache
>
> [1] https://archive.ics.uci.edu/ml/datasets/YearPredictionMSD
>