You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Sanjib Kumar Das <sa...@gmail.com> on 2010/10/29 07:06:55 UTC

Why can't i train using the entire dataset while RMSE evaluation?

I want to train my recommender with the entire dataset while evaluating it's
RMSE.
It gives a NaN when i set the trainingPercentage=1
I know i can set it to 0.99 and get my work done, but logically there is
nothing wrong in setting it to 1.0.

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Ted Dunning <te...@gmail.com>.

It isn't so much that it doesn't make sense, as it just gives you garbage
for answers.

On Thu, Oct 28, 2010 at 10:32 PM, Sanjib Kumar Das <sa...@gmail.com>wrote:

> Why do you say that it "does not make sense" ?
>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Sanjib Kumar Das <sa...@gmail.com>.

Yeah Lance I found your Dual verions very useful. Thanks a lot!

On Sat, Oct 30, 2010 at 10:54 PM, Lance Norskog <go...@gmail.com> wrote:

> Hi-
>
> Did the Dual versions work for you? If so I'll clean them up and post them.
>
> Lance
>
> On Fri, Oct 29, 2010 at 12:40 PM, Sanjib Kumar Das <sa...@gmail.com>
> wrote:
> > No it won't give an RMSE of 0.
> >
> > I ran an SVDRecommender (numFeatures  30, initialSteps = 50) on 1M
> MovieLens
> > dataset and I got the output below. The
> > AbstractDifferenceRecommenderEvaluatorDual just has an added method
> > evaluateDual(RecommenderBuilder rb, DataModel trainingModel, DataModel
> > testingModel)
> > And i specified the same data model for both training and testing.
> >
> > And thanks Lance for the 'Dual' versions of the evaluators.
> >
> > 10/10/29 13:45:49 INFO file.FileDataModel: Creating FileDataModel for
> file
> > /tmp/ratings.txt
> > 10/10/29 13:45:49 INFO file.FileDataModel: Reading file info...
> > 10/10/29 13:45:52 INFO file.FileDataModel: Processed 1000000 lines
> > 10/10/29 13:45:52 INFO file.FileDataModel: Read lines: 1000209
> > 10/10/29 13:45:54 INFO model.GenericDataModel: Processed 6040 users
> > 10/10/29 13:45:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Beginning evaluation using training and test models
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Beginning evaluation of 6040 users
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Starting timing of 6040 tasks in 2 threads
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Average time per recommendation: 0ms
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Approximate memory used: 111MB / 314MB
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Average time per recommendation: 0ms
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Approximate memory used: 116MB / 314MB
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Average time per recommendation: 0ms
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Approximate memory used: 116MB / 314MB
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Average time per recommendation: 0ms
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Approximate memory used: 116MB / 314MB
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Average time per recommendation: 0ms
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Approximate memory used: 126MB / 314MB
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Average time per recommendation: 0ms
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Approximate memory used: 126MB / 314MB
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Average time per recommendation: 0ms
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Approximate memory used: 126MB / 314MB
> > 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> > Evaluation result: 0.7368197239221315
> > 10/10/29 13:56:54 INFO bucky.BuckyRecommenderEvaluatorRunner:
> > 0.7368197239221315
> >
> > On Fri, Oct 29, 2010 at 2:28 AM, Sean Owen <sr...@gmail.com> wrote:
> >
> >> It's true that the recommenders will give you a score of 0 when using
> 100%
> >> of the input for training for the reasons given. That should be the
> case:
> >> it
> >> doesn't need to estimate any answers, it knows them already.
> >>
> >> But yes I see your question now. No there is not a direct way to do it,
> >> but,
> >> I think you'll find it easy to hack into the code that's there now. Just
> >> replace the step that splits the training/test data with one that keeps
> all
> >> data as training and loads something else as test.
> >>
> >> On Fri, Oct 29, 2010 at 6:56 AM, Sanjib Kumar Das <sanjib.kgp@gmail.com
> >> >wrote:
> >>
> >> > Some weird kind of miscommunication has taken place. Just to be on the
> >> same
> >> > page :
> >> >
> >> > I know it makes sense to keep the 'unseen' testing data different from
> >> the
> >> > 'seen' training data. But I wanted to evaluate my recommender for all
> >> > possible scenarios....and one such scenario was keeping the testing
> and
> >> > training data the same.
> >> >
> >> > I'l again rephrase my question :
> >> > Can I specify one file(data set) to be used for training and another
> >> > file(data set) to be used for testing instead of specifying
> percentages?
> >> >
> >> >
> >>
> >
>
>
>
>  --
> Lance Norskog
> goksron@gmail.com
>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Lance Norskog <go...@gmail.com>.

Hi-

Did the Dual versions work for you? If so I'll clean them up and post them.

Lance

On Fri, Oct 29, 2010 at 12:40 PM, Sanjib Kumar Das <sa...@gmail.com> wrote:
> No it won't give an RMSE of 0.
>
> I ran an SVDRecommender (numFeatures  30, initialSteps = 50) on 1M MovieLens
> dataset and I got the output below. The
> AbstractDifferenceRecommenderEvaluatorDual just has an added method
> evaluateDual(RecommenderBuilder rb, DataModel trainingModel, DataModel
> testingModel)
> And i specified the same data model for both training and testing.
>
> And thanks Lance for the 'Dual' versions of the evaluators.
>
> 10/10/29 13:45:49 INFO file.FileDataModel: Creating FileDataModel for file
> /tmp/ratings.txt
> 10/10/29 13:45:49 INFO file.FileDataModel: Reading file info...
> 10/10/29 13:45:52 INFO file.FileDataModel: Processed 1000000 lines
> 10/10/29 13:45:52 INFO file.FileDataModel: Read lines: 1000209
> 10/10/29 13:45:54 INFO model.GenericDataModel: Processed 6040 users
> 10/10/29 13:45:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Beginning evaluation using training and test models
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Beginning evaluation of 6040 users
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Starting timing of 6040 tasks in 2 threads
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Average time per recommendation: 0ms
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Approximate memory used: 111MB / 314MB
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Average time per recommendation: 0ms
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Approximate memory used: 116MB / 314MB
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Average time per recommendation: 0ms
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Approximate memory used: 116MB / 314MB
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Average time per recommendation: 0ms
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Approximate memory used: 116MB / 314MB
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Average time per recommendation: 0ms
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Approximate memory used: 126MB / 314MB
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Average time per recommendation: 0ms
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Approximate memory used: 126MB / 314MB
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Average time per recommendation: 0ms
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Approximate memory used: 126MB / 314MB
> 10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
> Evaluation result: 0.7368197239221315
> 10/10/29 13:56:54 INFO bucky.BuckyRecommenderEvaluatorRunner:
> 0.7368197239221315
>
> On Fri, Oct 29, 2010 at 2:28 AM, Sean Owen <sr...@gmail.com> wrote:
>
>> It's true that the recommenders will give you a score of 0 when using 100%
>> of the input for training for the reasons given. That should be the case:
>> it
>> doesn't need to estimate any answers, it knows them already.
>>
>> But yes I see your question now. No there is not a direct way to do it,
>> but,
>> I think you'll find it easy to hack into the code that's there now. Just
>> replace the step that splits the training/test data with one that keeps all
>> data as training and loads something else as test.
>>
>> On Fri, Oct 29, 2010 at 6:56 AM, Sanjib Kumar Das <sanjib.kgp@gmail.com
>> >wrote:
>>
>> > Some weird kind of miscommunication has taken place. Just to be on the
>> same
>> > page :
>> >
>> > I know it makes sense to keep the 'unseen' testing data different from
>> the
>> > 'seen' training data. But I wanted to evaluate my recommender for all
>> > possible scenarios....and one such scenario was keeping the testing and
>> > training data the same.
>> >
>> > I'l again rephrase my question :
>> > Can I specify one file(data set) to be used for training and another
>> > file(data set) to be used for testing instead of specifying percentages?
>> >
>> >
>>
>



-- 
Lance Norskog
goksron@gmail.com

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Sean Owen <sr...@gmail.com>.

You're right, this implementation is exceptional. It does not check to see
if it already "knows the answer" and return a known preference. I'd regard
it as a small deficiency.

On Fri, Oct 29, 2010 at 8:40 PM, Sanjib Kumar Das <sa...@gmail.com>wrote:

> No it won't give an RMSE of 0.
>
>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Sanjib Kumar Das <sa...@gmail.com>.

No it won't give an RMSE of 0.

I ran an SVDRecommender (numFeatures  30, initialSteps = 50) on 1M MovieLens
dataset and I got the output below. The
AbstractDifferenceRecommenderEvaluatorDual just has an added method
evaluateDual(RecommenderBuilder rb, DataModel trainingModel, DataModel
testingModel)
And i specified the same data model for both training and testing.

And thanks Lance for the 'Dual' versions of the evaluators.

10/10/29 13:45:49 INFO file.FileDataModel: Creating FileDataModel for file
/tmp/ratings.txt
10/10/29 13:45:49 INFO file.FileDataModel: Reading file info...
10/10/29 13:45:52 INFO file.FileDataModel: Processed 1000000 lines
10/10/29 13:45:52 INFO file.FileDataModel: Read lines: 1000209
10/10/29 13:45:54 INFO model.GenericDataModel: Processed 6040 users
10/10/29 13:45:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Beginning evaluation using training and test models
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Beginning evaluation of 6040 users
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Starting timing of 6040 tasks in 2 threads
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Average time per recommendation: 0ms
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Approximate memory used: 111MB / 314MB
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Average time per recommendation: 0ms
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Approximate memory used: 116MB / 314MB
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Average time per recommendation: 0ms
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Approximate memory used: 116MB / 314MB
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Average time per recommendation: 0ms
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Approximate memory used: 116MB / 314MB
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Average time per recommendation: 0ms
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Approximate memory used: 126MB / 314MB
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Average time per recommendation: 0ms
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Approximate memory used: 126MB / 314MB
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Average time per recommendation: 0ms
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Approximate memory used: 126MB / 314MB
10/10/29 13:56:54 INFO eval.AbstractDifferenceRecommenderEvaluatorDual:
Evaluation result: 0.7368197239221315
10/10/29 13:56:54 INFO bucky.BuckyRecommenderEvaluatorRunner:
0.7368197239221315

On Fri, Oct 29, 2010 at 2:28 AM, Sean Owen <sr...@gmail.com> wrote:

> It's true that the recommenders will give you a score of 0 when using 100%
> of the input for training for the reasons given. That should be the case:
> it
> doesn't need to estimate any answers, it knows them already.
>
> But yes I see your question now. No there is not a direct way to do it,
> but,
> I think you'll find it easy to hack into the code that's there now. Just
> replace the step that splits the training/test data with one that keeps all
> data as training and loads something else as test.
>
> On Fri, Oct 29, 2010 at 6:56 AM, Sanjib Kumar Das <sanjib.kgp@gmail.com
> >wrote:
>
> > Some weird kind of miscommunication has taken place. Just to be on the
> same
> > page :
> >
> > I know it makes sense to keep the 'unseen' testing data different from
> the
> > 'seen' training data. But I wanted to evaluate my recommender for all
> > possible scenarios....and one such scenario was keeping the testing and
> > training data the same.
> >
> > I'l again rephrase my question :
> > Can I specify one file(data set) to be used for training and another
> > file(data set) to be used for testing instead of specifying percentages?
> >
> >
>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Sean Owen <sr...@gmail.com>.

It's true that the recommenders will give you a score of 0 when using 100%
of the input for training for the reasons given. That should be the case: it
doesn't need to estimate any answers, it knows them already.

But yes I see your question now. No there is not a direct way to do it, but,
I think you'll find it easy to hack into the code that's there now. Just
replace the step that splits the training/test data with one that keeps all
data as training and loads something else as test.

On Fri, Oct 29, 2010 at 6:56 AM, Sanjib Kumar Das <sa...@gmail.com>wrote:

> Some weird kind of miscommunication has taken place. Just to be on the same
> page :
>
> I know it makes sense to keep the 'unseen' testing data different from the
> 'seen' training data. But I wanted to evaluate my recommender for all
> possible scenarios....and one such scenario was keeping the testing and
> training data the same.
>
> I'l again rephrase my question :
> Can I specify one file(data set) to be used for training and another
> file(data set) to be used for testing instead of specifying percentages?
>
>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Sanjib Kumar Das <sa...@gmail.com>.

Some weird kind of miscommunication has taken place. Just to be on the same
page :

I know it makes sense to keep the 'unseen' testing data different from the
'seen' training data. But I wanted to evaluate my recommender for all
possible scenarios....and one such scenario was keeping the testing and
training data the same.

I'l again rephrase my question :
Can I specify one file(data set) to be used for training and another
file(data set) to be used for testing instead of specifying percentages?


On Fri, Oct 29, 2010 at 12:40 AM, Gabriel Webster
<ga...@htc.com>wrote:

> Read the wiki page; you might also want to read up on machine learning more
> generally.  The example that Tommy gave is the extreme, straw man example in
> which the training algorithm simply memorizes the training data.  Most real
> training algorithms don't actually return 100% accuracy on the training
> data, because they effectively compress the training data into a model, and
> because this compression is lossy, it can't memorize the data exactly.  But
> what you hope is that the compression throws out the information that is
> specific to the training data (and is thus useless for predicting test data
> points), and keeps the information that describes the general behaviour of
> the data (which will help predict the test data).  These ideas are
> formalized in, for example, Minimum Description Length, so you might want to
> read up on that as well.  But the upshot is that most real algorithms
> perform significantly better on seen training data than on unseen test data,
> so testing on training data gives you incorrectly high accuracy (incorrect
> because in the real world, you will be running your algorithm on unseen
> data).
>
>
> On 10/29/10 1:32 PM, Sanjib Kumar Das wrote:
>
>> Why do you say that it "does not make sense" ?
>> Do you mean to say that if i train and test on the entire dataset, i
>> should
>> get an RMSE of 0 trivially?
>> which is not true.
>> Consider the SVD recommender.
>> M != LR (where M is the original matrix and L,R matrices are obtained
>> after
>> factorization).
>>
>> okay, let me rephrase my doubt this way :
>> Is it possible to specify one dataset for training and another dataset for
>> testing while evaluating the recommender?
>>
>> On Fri, Oct 29, 2010 at 12:20 AM, Tommy Chheng<tommy.chheng@gmail.com
>> >wrote:
>>
>>   Training and testing with the entire set does not make sense. Read about
>>> over fitting for more details: http://en.wikipedia.org/wiki/Overfitting
>>>
>>> "As a simple example, consider a database of retail purchases that
>>> includes
>>> the item bought, the purchaser, and the date and time of purchase. It's
>>> easy
>>> to construct a model that will fit the training set perfectly by using
>>> the
>>> date and time of purchase to predict the other attributes; *but this
>>> model
>>> will not generalize at all to new data, because those past times will
>>> never
>>> occur again.*"
>>>
>>> @tommychheng
>>>
>>> On 10/28/10 10:11 PM, Sanjib Kumar Das wrote:
>>>
>>> Suppose i want to train with the entire data set and test it with the
>>> entire
>>> data set, how should i go about it?
>>>
>>> On Fri, Oct 29, 2010 at 12:09 AM, Gabriel Webster<
>>> gabriel_webster@htc.com>  <ga...@htc.com>wrote:
>>>
>>>
>>>
>>>  Logically there is something wrong with setting the training percentage
>>> to
>>> 1.0, because that means the testing percentage is 0.0!  If you don't test
>>> on
>>> any items then you can't get an RMSE.
>>>
>>>
>>> On 10/29/10 1:06 PM, Sanjib Kumar Das wrote:
>>>
>>>
>>>  I want to train my recommender with the entire dataset while evaluating
>>> it's
>>> RMSE.
>>> It gives a NaN when i set the trainingPercentage=1
>>> I know i can set it to 0.99 and get my work done, but logically there is
>>> nothing wrong in setting it to 1.0.
>>>
>>>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Gabriel Webster <ga...@htc.com>.

Read the wiki page; you might also want to read up on machine learning 
more generally.  The example that Tommy gave is the extreme, straw man 
example in which the training algorithm simply memorizes the training 
data.  Most real training algorithms don't actually return 100% accuracy 
on the training data, because they effectively compress the training 
data into a model, and because this compression is lossy, it can't 
memorize the data exactly.  But what you hope is that the compression 
throws out the information that is specific to the training data (and is 
thus useless for predicting test data points), and keeps the information 
that describes the general behaviour of the data (which will help 
predict the test data).  These ideas are formalized in, for example, 
Minimum Description Length, so you might want to read up on that as 
well.  But the upshot is that most real algorithms perform significantly 
better on seen training data than on unseen test data, so testing on 
training data gives you incorrectly high accuracy (incorrect because in 
the real world, you will be running your algorithm on unseen data).

On 10/29/10 1:32 PM, Sanjib Kumar Das wrote:
> Why do you say that it "does not make sense" ?
> Do you mean to say that if i train and test on the entire dataset, i should
> get an RMSE of 0 trivially?
> which is not true.
> Consider the SVD recommender.
> M != LR (where M is the original matrix and L,R matrices are obtained after
> factorization).
>
> okay, let me rephrase my doubt this way :
> Is it possible to specify one dataset for training and another dataset for
> testing while evaluating the recommender?
>
> On Fri, Oct 29, 2010 at 12:20 AM, Tommy Chheng<to...@gmail.com>wrote:
>
>>   Training and testing with the entire set does not make sense. Read about
>> over fitting for more details: http://en.wikipedia.org/wiki/Overfitting
>>
>> "As a simple example, consider a database of retail purchases that includes
>> the item bought, the purchaser, and the date and time of purchase. It's easy
>> to construct a model that will fit the training set perfectly by using the
>> date and time of purchase to predict the other attributes; *but this model
>> will not generalize at all to new data, because those past times will never
>> occur again.*"
>>
>> @tommychheng
>>
>> On 10/28/10 10:11 PM, Sanjib Kumar Das wrote:
>>
>> Suppose i want to train with the entire data set and test it with the entire
>> data set, how should i go about it?
>>
>> On Fri, Oct 29, 2010 at 12:09 AM, Gabriel Webster<ga...@htc.com>  <ga...@htc.com>wrote:
>>
>>
>>   Logically there is something wrong with setting the training percentage to
>> 1.0, because that means the testing percentage is 0.0!  If you don't test on
>> any items then you can't get an RMSE.
>>
>>
>> On 10/29/10 1:06 PM, Sanjib Kumar Das wrote:
>>
>>
>>   I want to train my recommender with the entire dataset while evaluating
>> it's
>> RMSE.
>> It gives a NaN when i set the trainingPercentage=1
>> I know i can set it to 0.99 and get my work done, but logically there is
>> nothing wrong in setting it to 1.0.
>>
>>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Sanjib Kumar Das <sa...@gmail.com>.

Why do you say that it "does not make sense" ?
Do you mean to say that if i train and test on the entire dataset, i should
get an RMSE of 0 trivially?
which is not true.
Consider the SVD recommender.
M != LR (where M is the original matrix and L,R matrices are obtained after
factorization).

okay, let me rephrase my doubt this way :
Is it possible to specify one dataset for training and another dataset for
testing while evaluating the recommender?

On Fri, Oct 29, 2010 at 12:20 AM, Tommy Chheng <to...@gmail.com>wrote:

>  Training and testing with the entire set does not make sense. Read about
> over fitting for more details: http://en.wikipedia.org/wiki/Overfitting
>
> "As a simple example, consider a database of retail purchases that includes
> the item bought, the purchaser, and the date and time of purchase. It's easy
> to construct a model that will fit the training set perfectly by using the
> date and time of purchase to predict the other attributes; *but this model
> will not generalize at all to new data, because those past times will never
> occur again.*"
>
> @tommychheng
>
> On 10/28/10 10:11 PM, Sanjib Kumar Das wrote:
>
> Suppose i want to train with the entire data set and test it with the entire
> data set, how should i go about it?
>
> On Fri, Oct 29, 2010 at 12:09 AM, Gabriel Webster<ga...@htc.com> <ga...@htc.com>wrote:
>
>
>  Logically there is something wrong with setting the training percentage to
> 1.0, because that means the testing percentage is 0.0!  If you don't test on
> any items then you can't get an RMSE.
>
>
> On 10/29/10 1:06 PM, Sanjib Kumar Das wrote:
>
>
>  I want to train my recommender with the entire dataset while evaluating
> it's
> RMSE.
> It gives a NaN when i set the trainingPercentage=1
> I know i can set it to 0.99 and get my work done, but logically there is
> nothing wrong in setting it to 1.0.
>
>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Tommy Chheng <to...@gmail.com>.

Training and testing with the entire set does not make sense. Read about 
over fitting for more details: http://en.wikipedia.org/wiki/Overfitting

"As a simple example, consider a database of retail purchases that 
includes the item bought, the purchaser, and the date and time of 
purchase. It's easy to construct a model that will fit the training set 
perfectly by using the date and time of purchase to predict the other 
attributes; *but this model will not generalize at all to new data, 
because those past times will never occur again.*"

@tommychheng

On 10/28/10 10:11 PM, Sanjib Kumar Das wrote:
> Suppose i want to train with the entire data set and test it with the entire
> data set, how should i go about it?
>
> On Fri, Oct 29, 2010 at 12:09 AM, Gabriel Webster
> <ga...@htc.com>wrote:
>
>> Logically there is something wrong with setting the training percentage to
>> 1.0, because that means the testing percentage is 0.0!  If you don't test on
>> any items then you can't get an RMSE.
>>
>>
>> On 10/29/10 1:06 PM, Sanjib Kumar Das wrote:
>>
>>> I want to train my recommender with the entire dataset while evaluating
>>> it's
>>> RMSE.
>>> It gives a NaN when i set the trainingPercentage=1
>>> I know i can set it to 0.99 and get my work done, but logically there is
>>> nothing wrong in setting it to 1.0.
>>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Sanjib Kumar Das <sa...@gmail.com>.

Suppose i want to train with the entire data set and test it with the entire
data set, how should i go about it?

On Fri, Oct 29, 2010 at 12:09 AM, Gabriel Webster
<ga...@htc.com>wrote:

> Logically there is something wrong with setting the training percentage to
> 1.0, because that means the testing percentage is 0.0!  If you don't test on
> any items then you can't get an RMSE.
>
>
> On 10/29/10 1:06 PM, Sanjib Kumar Das wrote:
>
>> I want to train my recommender with the entire dataset while evaluating
>> it's
>> RMSE.
>> It gives a NaN when i set the trainingPercentage=1
>> I know i can set it to 0.99 and get my work done, but logically there is
>> nothing wrong in setting it to 1.0.
>
>

Re: Why can't i train using the entire dataset while RMSE evaluation?

Posted by Gabriel Webster <ga...@htc.com>.

Logically there is something wrong with setting the training percentage 
to 1.0, because that means the testing percentage is 0.0!  If you don't 
test on any items then you can't get an RMSE.

On 10/29/10 1:06 PM, Sanjib Kumar Das wrote:
> I want to train my recommender with the entire dataset while evaluating it's
> RMSE.
> It gives a NaN when i set the trainingPercentage=1
> I know i can set it to 0.99 and get my work done, but logically there is
> nothing wrong in setting it to 1.0.