You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@spark.apache.org by VG <vl...@gmail.com> on 2016/07/23 18:37:48 UTC

Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

I am trying to run ml.ALS to compute some recommendations.

Just to test I am using the same dataset for training using ALSModel and
for predicting the results based on the model .

When I evaluate the result using RegressionEvaluator I get a
Root-mean-square error = 1.5544064263236066

I thin this should be 0. Any suggestions what might be going wrong.

Regards,
Vipul

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by VG <vl...@gmail.com>.

Any suggestions / ideas here ?



On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:

> Sean,
>
> I did this just to test the model. When I do a split of my data as
> training to 80% and test to be 20%
>
> I get a Root-mean-square error = NaN
>
> So I am wondering where I might be going wrong
>
> Regards,
> VG
>
> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> No, that's certainly not to be expected. ALS works by computing a much
>> lower-rank representation of the input. It would not reproduce the
>> input exactly, and you don't want it to -- this would be seriously
>> overfit. This is why in general you don't evaluate a model on the
>> training set.
>>
>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>> > I am trying to run ml.ALS to compute some recommendations.
>> >
>> > Just to test I am using the same dataset for training using ALSModel
>> and for
>> > predicting the results based on the model .
>> >
>> > When I evaluate the result using RegressionEvaluator I get a
>> > Root-mean-square error = 1.5544064263236066
>> >
>> > I thin this should be 0. Any suggestions what might be going wrong.
>> >
>> > Regards,
>> > Vipul
>>
>
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by Nick Pentreath <ni...@gmail.com>.

This is exactly the core problem in the linked issue - normally you would
use the TrainValidationSplit or CrossValidator to do hyper-parameter
selection using cross-validation. You could tune the factor size,
regularization parameter and alpha (for implicit preference data), for
example.

Because of the NaN issue you cannot use the cross-validators currently with
ALS. So you would have to do it yourself manually (dropping the NaNs from
the prediction results as Krishna says).



On Mon, 25 Jul 2016 at 11:40 Rohit Chaddha <ro...@gmail.com>
wrote:

> Hi Krishna,
>
> Great .. I had no idea about this.  I tried your suggestion by using
> na.drop() and got a rmse = 1.5794048211812495
> Any suggestions how this can be reduced and the model improved ?
>
> Regards,
> Rohit
>
> On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar <ks...@gmail.com>
> wrote:
>
>> Thanks Nick. I also ran into this issue.
>> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
>> then use the dataset for the evaluator. In real life, probably detect the
>> NaN and recommend most popular on some window.
>> HTH.
>> Cheers
>> <k/>
>>
>> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath <
>> nick.pentreath@gmail.com> wrote:
>>
>>> It seems likely that you're running into
>>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when
>>> the test dataset in the train/test split contains users or items that were
>>> not in the training set. Hence the model doesn't have computed factors for
>>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>>> turn results in NaN for the evaluator result.
>>>
>>> I have a PR open on that issue that will hopefully address this soon.
>>>
>>>
>>> On Sun, 24 Jul 2016 at 17:49 VG <vl...@gmail.com> wrote:
>>>
>>>> ping. Anyone has some suggestions/advice for me .
>>>> It will be really helpful.
>>>>
>>>> VG
>>>> On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:
>>>>
>>>>> Sean,
>>>>>
>>>>> I did this just to test the model. When I do a split of my data as
>>>>> training to 80% and test to be 20%
>>>>>
>>>>> I get a Root-mean-square error = NaN
>>>>>
>>>>> So I am wondering where I might be going wrong
>>>>>
>>>>> Regards,
>>>>> VG
>>>>>
>>>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> No, that's certainly not to be expected. ALS works by computing a much
>>>>>> lower-rank representation of the input. It would not reproduce the
>>>>>> input exactly, and you don't want it to -- this would be seriously
>>>>>> overfit. This is why in general you don't evaluate a model on the
>>>>>> training set.
>>>>>>
>>>>>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>>>>>> > I am trying to run ml.ALS to compute some recommendations.
>>>>>> >
>>>>>> > Just to test I am using the same dataset for training using
>>>>>> ALSModel and for
>>>>>> > predicting the results based on the model .
>>>>>> >
>>>>>> > When I evaluate the result using RegressionEvaluator I get a
>>>>>> > Root-mean-square error = 1.5544064263236066
>>>>>> >
>>>>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>>>>> >
>>>>>> > Regards,
>>>>>> > Vipul
>>>>>>
>>>>>
>>>>>
>>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by Rohit Chaddha <ro...@gmail.com>.

Hi Krishna,

Great .. I had no idea about this.  I tried your suggestion by using
na.drop() and got a rmse = 1.5794048211812495
Any suggestions how this can be reduced and the model improved ?

Regards,
Rohit

On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar <ks...@gmail.com> wrote:

> Thanks Nick. I also ran into this issue.
> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
> then use the dataset for the evaluator. In real life, probably detect the
> NaN and recommend most popular on some window.
> HTH.
> Cheers
> <k/>
>
> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
>> It seems likely that you're running into
>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
>> test dataset in the train/test split contains users or items that were not
>> in the training set. Hence the model doesn't have computed factors for
>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>> turn results in NaN for the evaluator result.
>>
>> I have a PR open on that issue that will hopefully address this soon.
>>
>>
>> On Sun, 24 Jul 2016 at 17:49 VG <vl...@gmail.com> wrote:
>>
>>> ping. Anyone has some suggestions/advice for me .
>>> It will be really helpful.
>>>
>>> VG
>>> On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:
>>>
>>>> Sean,
>>>>
>>>> I did this just to test the model. When I do a split of my data as
>>>> training to 80% and test to be 20%
>>>>
>>>> I get a Root-mean-square error = NaN
>>>>
>>>> So I am wondering where I might be going wrong
>>>>
>>>> Regards,
>>>> VG
>>>>
>>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> No, that's certainly not to be expected. ALS works by computing a much
>>>>> lower-rank representation of the input. It would not reproduce the
>>>>> input exactly, and you don't want it to -- this would be seriously
>>>>> overfit. This is why in general you don't evaluate a model on the
>>>>> training set.
>>>>>
>>>>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>>>>> > I am trying to run ml.ALS to compute some recommendations.
>>>>> >
>>>>> > Just to test I am using the same dataset for training using ALSModel
>>>>> and for
>>>>> > predicting the results based on the model .
>>>>> >
>>>>> > When I evaluate the result using RegressionEvaluator I get a
>>>>> > Root-mean-square error = 1.5544064263236066
>>>>> >
>>>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>>>> >
>>>>> > Regards,
>>>>> > Vipul
>>>>>
>>>>
>>>>
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by Nick Pentreath <ni...@gmail.com>.

Good suggestion Krishna

One issue is that this doesn't work with TrainValidationSplit or
CrossValidator for parameter tuning. Hence my solution in the PR which
makes it work with the cross-validators.

On Mon, 25 Jul 2016 at 00:42, Krishna Sankar <ks...@gmail.com> wrote:

> Thanks Nick. I also ran into this issue.
> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
> then use the dataset for the evaluator. In real life, probably detect the
> NaN and recommend most popular on some window.
> HTH.
> Cheers
> <k/>
>
> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
>> It seems likely that you're running into
>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
>> test dataset in the train/test split contains users or items that were not
>> in the training set. Hence the model doesn't have computed factors for
>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>> turn results in NaN for the evaluator result.
>>
>> I have a PR open on that issue that will hopefully address this soon.
>>
>>
>> On Sun, 24 Jul 2016 at 17:49 VG <vl...@gmail.com> wrote:
>>
>>> ping. Anyone has some suggestions/advice for me .
>>> It will be really helpful.
>>>
>>> VG
>>> On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:
>>>
>>>> Sean,
>>>>
>>>> I did this just to test the model. When I do a split of my data as
>>>> training to 80% and test to be 20%
>>>>
>>>> I get a Root-mean-square error = NaN
>>>>
>>>> So I am wondering where I might be going wrong
>>>>
>>>> Regards,
>>>> VG
>>>>
>>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> No, that's certainly not to be expected. ALS works by computing a much
>>>>> lower-rank representation of the input. It would not reproduce the
>>>>> input exactly, and you don't want it to -- this would be seriously
>>>>> overfit. This is why in general you don't evaluate a model on the
>>>>> training set.
>>>>>
>>>>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>>>>> > I am trying to run ml.ALS to compute some recommendations.
>>>>> >
>>>>> > Just to test I am using the same dataset for training using ALSModel
>>>>> and for
>>>>> > predicting the results based on the model .
>>>>> >
>>>>> > When I evaluate the result using RegressionEvaluator I get a
>>>>> > Root-mean-square error = 1.5544064263236066
>>>>> >
>>>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>>>> >
>>>>> > Regards,
>>>>> > Vipul
>>>>>
>>>>
>>>>
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by Rohit Chaddha <ro...@gmail.com>.

Great thanks both of you.  I was struggling with this issue as well.

-Rohit


On Mon, Jul 25, 2016 at 4:12 AM, Krishna Sankar <ks...@gmail.com> wrote:

> Thanks Nick. I also ran into this issue.
> VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
> then use the dataset for the evaluator. In real life, probably detect the
> NaN and recommend most popular on some window.
> HTH.
> Cheers
> <k/>
>
> On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath <nick.pentreath@gmail.com
> > wrote:
>
>> It seems likely that you're running into
>> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
>> test dataset in the train/test split contains users or items that were not
>> in the training set. Hence the model doesn't have computed factors for
>> those ids, and ALS 'transform' currently returns NaN for those ids. This in
>> turn results in NaN for the evaluator result.
>>
>> I have a PR open on that issue that will hopefully address this soon.
>>
>>
>> On Sun, 24 Jul 2016 at 17:49 VG <vl...@gmail.com> wrote:
>>
>>> ping. Anyone has some suggestions/advice for me .
>>> It will be really helpful.
>>>
>>> VG
>>> On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:
>>>
>>>> Sean,
>>>>
>>>> I did this just to test the model. When I do a split of my data as
>>>> training to 80% and test to be 20%
>>>>
>>>> I get a Root-mean-square error = NaN
>>>>
>>>> So I am wondering where I might be going wrong
>>>>
>>>> Regards,
>>>> VG
>>>>
>>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:
>>>>
>>>>> No, that's certainly not to be expected. ALS works by computing a much
>>>>> lower-rank representation of the input. It would not reproduce the
>>>>> input exactly, and you don't want it to -- this would be seriously
>>>>> overfit. This is why in general you don't evaluate a model on the
>>>>> training set.
>>>>>
>>>>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>>>>> > I am trying to run ml.ALS to compute some recommendations.
>>>>> >
>>>>> > Just to test I am using the same dataset for training using ALSModel
>>>>> and for
>>>>> > predicting the results based on the model .
>>>>> >
>>>>> > When I evaluate the result using RegressionEvaluator I get a
>>>>> > Root-mean-square error = 1.5544064263236066
>>>>> >
>>>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>>>> >
>>>>> > Regards,
>>>>> > Vipul
>>>>>
>>>>
>>>>
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by Krishna Sankar <ks...@gmail.com>.

Thanks Nick. I also ran into this issue.
VG, One workaround is to drop the NaN from predictions (df.na.drop()) and
then use the dataset for the evaluator. In real life, probably detect the
NaN and recommend most popular on some window.
HTH.
Cheers
<k/>

On Sun, Jul 24, 2016 at 12:49 PM, Nick Pentreath <ni...@gmail.com>
wrote:

> It seems likely that you're running into
> https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
> test dataset in the train/test split contains users or items that were not
> in the training set. Hence the model doesn't have computed factors for
> those ids, and ALS 'transform' currently returns NaN for those ids. This in
> turn results in NaN for the evaluator result.
>
> I have a PR open on that issue that will hopefully address this soon.
>
>
> On Sun, 24 Jul 2016 at 17:49 VG <vl...@gmail.com> wrote:
>
>> ping. Anyone has some suggestions/advice for me .
>> It will be really helpful.
>>
>> VG
>> On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:
>>
>>> Sean,
>>>
>>> I did this just to test the model. When I do a split of my data as
>>> training to 80% and test to be 20%
>>>
>>> I get a Root-mean-square error = NaN
>>>
>>> So I am wondering where I might be going wrong
>>>
>>> Regards,
>>> VG
>>>
>>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:
>>>
>>>> No, that's certainly not to be expected. ALS works by computing a much
>>>> lower-rank representation of the input. It would not reproduce the
>>>> input exactly, and you don't want it to -- this would be seriously
>>>> overfit. This is why in general you don't evaluate a model on the
>>>> training set.
>>>>
>>>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>>>> > I am trying to run ml.ALS to compute some recommendations.
>>>> >
>>>> > Just to test I am using the same dataset for training using ALSModel
>>>> and for
>>>> > predicting the results based on the model .
>>>> >
>>>> > When I evaluate the result using RegressionEvaluator I get a
>>>> > Root-mean-square error = 1.5544064263236066
>>>> >
>>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>>> >
>>>> > Regards,
>>>> > Vipul
>>>>
>>>
>>>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by Nick Pentreath <ni...@gmail.com>.

It seems likely that you're running into
https://issues.apache.org/jira/browse/SPARK-14489 - this occurs when the
test dataset in the train/test split contains users or items that were not
in the training set. Hence the model doesn't have computed factors for
those ids, and ALS 'transform' currently returns NaN for those ids. This in
turn results in NaN for the evaluator result.

I have a PR open on that issue that will hopefully address this soon.


On Sun, 24 Jul 2016 at 17:49 VG <vl...@gmail.com> wrote:

> ping. Anyone has some suggestions/advice for me .
> It will be really helpful.
>
> VG
> On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:
>
>> Sean,
>>
>> I did this just to test the model. When I do a split of my data as
>> training to 80% and test to be 20%
>>
>> I get a Root-mean-square error = NaN
>>
>> So I am wondering where I might be going wrong
>>
>> Regards,
>> VG
>>
>> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> No, that's certainly not to be expected. ALS works by computing a much
>>> lower-rank representation of the input. It would not reproduce the
>>> input exactly, and you don't want it to -- this would be seriously
>>> overfit. This is why in general you don't evaluate a model on the
>>> training set.
>>>
>>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>>> > I am trying to run ml.ALS to compute some recommendations.
>>> >
>>> > Just to test I am using the same dataset for training using ALSModel
>>> and for
>>> > predicting the results based on the model .
>>> >
>>> > When I evaluate the result using RegressionEvaluator I get a
>>> > Root-mean-square error = 1.5544064263236066
>>> >
>>> > I thin this should be 0. Any suggestions what might be going wrong.
>>> >
>>> > Regards,
>>> > Vipul
>>>
>>
>>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by VG <vl...@gmail.com>.

ping. Anyone has some suggestions/advice for me .
It will be really helpful.

VG

On Sun, Jul 24, 2016 at 12:19 AM, VG <vl...@gmail.com> wrote:

> Sean,
>
> I did this just to test the model. When I do a split of my data as
> training to 80% and test to be 20%
>
> I get a Root-mean-square error = NaN
>
> So I am wondering where I might be going wrong
>
> Regards,
> VG
>
> On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:
>
>> No, that's certainly not to be expected. ALS works by computing a much
>> lower-rank representation of the input. It would not reproduce the
>> input exactly, and you don't want it to -- this would be seriously
>> overfit. This is why in general you don't evaluate a model on the
>> training set.
>>
>> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
>> > I am trying to run ml.ALS to compute some recommendations.
>> >
>> > Just to test I am using the same dataset for training using ALSModel
>> and for
>> > predicting the results based on the model .
>> >
>> > When I evaluate the result using RegressionEvaluator I get a
>> > Root-mean-square error = 1.5544064263236066
>> >
>> > I thin this should be 0. Any suggestions what might be going wrong.
>> >
>> > Regards,
>> > Vipul
>>
>
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by VG <vl...@gmail.com>.

Sean,

I did this just to test the model. When I do a split of my data as training
to 80% and test to be 20%

I get a Root-mean-square error = NaN

So I am wondering where I might be going wrong

Regards,
VG

On Sun, Jul 24, 2016 at 12:12 AM, Sean Owen <so...@cloudera.com> wrote:

> No, that's certainly not to be expected. ALS works by computing a much
> lower-rank representation of the input. It would not reproduce the
> input exactly, and you don't want it to -- this would be seriously
> overfit. This is why in general you don't evaluate a model on the
> training set.
>
> On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
> > I am trying to run ml.ALS to compute some recommendations.
> >
> > Just to test I am using the same dataset for training using ALSModel and
> for
> > predicting the results based on the model .
> >
> > When I evaluate the result using RegressionEvaluator I get a
> > Root-mean-square error = 1.5544064263236066
> >
> > I thin this should be 0. Any suggestions what might be going wrong.
> >
> > Regards,
> > Vipul
>

Re: Spark ml.ALS question -- RegressionEvaluator .evaluate giving ~1.5 output for same train and predict data

Posted by Sean Owen <so...@cloudera.com>.

No, that's certainly not to be expected. ALS works by computing a much
lower-rank representation of the input. It would not reproduce the
input exactly, and you don't want it to -- this would be seriously
overfit. This is why in general you don't evaluate a model on the
training set.

On Sat, Jul 23, 2016 at 7:37 PM, VG <vl...@gmail.com> wrote:
> I am trying to run ml.ALS to compute some recommendations.
>
> Just to test I am using the same dataset for training using ALSModel and for
> predicting the results based on the model .
>
> When I evaluate the result using RegressionEvaluator I get a
> Root-mean-square error = 1.5544064263236066
>
> I thin this should be 0. Any suggestions what might be going wrong.
>
> Regards,
> Vipul

---------------------------------------------------------------------
To unsubscribe e-mail: user-unsubscribe@spark.apache.org