You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Xiaobo Gu <gu...@gmail.com> on 2011/06/01 11:22:18 UTC

Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?

Hi,

Because ADR split the training data internally automatically,so I
think we don't have to make a separate validation data set.

Regards,

Xiaobo Gu

Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?

Posted by Lance Norskog <go...@gmail.com>.

Ah! You have a domain problem here, in that your input set may not be
homogeneous over time. Data from different time periods might be
different. You may need to break your time series into bands, and
train&test with overlapping bands. That is, train on January through
April, then do two tests: verify with a test set from that time
period, and then test with March through July. This will show you
whether the data changes over time.

The AverageAbsoluteDifferenceRecommenderEvaluator does a train & test
across one dataset. It randomly picks, say, 80% for training and then
randomly picks perhaps 30-40% for testing. This is why I suggest using
overlapping bands up above: if the data changes over time, the two
tests will differ.

Lance

On Wed, Jun 1, 2011 at 8:03 PM, Xiaobo Gu <gu...@gmail.com> wrote:
> On our site we will use Logistic Regression in a batch manner,
> customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31)
> will be used to train the model, and customers entered in another time
> frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model,
> then the model will be used to predict users entered after 2011/6/1,
> does this make sense, or should we feed all data from 2010/1/1 to
> 2011/5/31 to ALR, and let it do the hold-out internally?
>
>
>
> On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com> wrote:
>> You don't *have* to have a separate validation set, but it isn't a bad idea.
>>
>> In particular, with large scale classifiers production data almost always
>> comes from the future with respect to the training data.  The ADR can't hold
>> out that way because it does on-line training only.  Thus, I would recommend
>> recommend that you still have some kind of evaluation hold-out set
>> segregated by time.
>>
>> Another very serious issue can happen if you have near duplicates in your
>> data set.  That often happens in news-wire text, for example.  In that case,
>> you would have significant over-fitting with ADR and you wouldn't have a
>> clue without a real time-segregated hold-out set.
>>
>> On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <gu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Because ADR split the training data internally automatically,so I
>>> think we don't have to make a separate validation data set.
>>>
>>> Regards,
>>>
>>> Xiaobo Gu
>>>
>>
>

-- 
Lance Norskog
goksron@gmail.com

Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?

Posted by Ted Dunning <te...@gmail.com>.

I prefer to make my final held-out set look as much like it will in
production.  So if you plan to retrain every week, I would train on all
available data up to time t and then test on data from t to t+1week.

ALR's internal hold-out set is useful, but things change over time and
having a held out sample from the future (relative to the model) is much
more realistic.

On Wed, Jun 1, 2011 at 8:03 PM, Xiaobo Gu <gu...@gmail.com> wrote:

> On our site we will use Logistic Regression in a batch manner,
> customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31)
> will be used to train the model, and customers entered in another time
> frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model,
> then the model will be used to predict users entered after 2011/6/1,
> does this make sense, or should we feed all data from 2010/1/1 to
> 2011/5/31 to ALR, and let it do the hold-out internally?
>
>
>
> On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > You don't *have* to have a separate validation set, but it isn't a bad
> idea.
> >
> > In particular, with large scale classifiers production data almost always
> > comes from the future with respect to the training data.  The ADR can't
> hold
> > out that way because it does on-line training only.  Thus, I would
> recommend
> > recommend that you still have some kind of evaluation hold-out set
> > segregated by time.
> >
> > Another very serious issue can happen if you have near duplicates in your
> > data set.  That often happens in news-wire text, for example.  In that
> case,
> > you would have significant over-fitting with ADR and you wouldn't have a
> > clue without a real time-segregated hold-out set.
> >
> > On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <gu...@gmail.com>
> wrote:
> >
> >> Hi,
> >>
> >> Because ADR split the training data internally automatically,so I
> >> think we don't have to make a separate validation data set.
> >>
> >> Regards,
> >>
> >> Xiaobo Gu
> >>
> >
>

Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?

Posted by Xiaobo Gu <gu...@gmail.com>.

On our site we will use Logistic Regression in a batch manner,
customers entered in one time frame(such as 2010/1/1 ~ 2010/12/31)
will be used to train the model, and customers entered in another time
frame(such as 2011/1/1 ~2011/5/31) will be used to validate the model,
then the model will be used to predict users entered after 2011/6/1,
does this make sense, or should we feed all data from 2010/1/1 to
2011/5/31 to ALR, and let it do the hold-out internally?

On Wed, Jun 1, 2011 at 10:18 PM, Ted Dunning <te...@gmail.com> wrote:
> You don't *have* to have a separate validation set, but it isn't a bad idea.
>
> In particular, with large scale classifiers production data almost always
> comes from the future with respect to the training data.  The ADR can't hold
> out that way because it does on-line training only.  Thus, I would recommend
> recommend that you still have some kind of evaluation hold-out set
> segregated by time.
>
> Another very serious issue can happen if you have near duplicates in your
> data set.  That often happens in news-wire text, for example.  In that case,
> you would have significant over-fitting with ADR and you wouldn't have a
> clue without a real time-segregated hold-out set.
>
> On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <gu...@gmail.com> wrote:
>
>> Hi,
>>
>> Because ADR split the training data internally automatically,so I
>> think we don't have to make a separate validation data set.
>>
>> Regards,
>>
>> Xiaobo Gu
>>
>

Re: Do we have to make a seperate hold-out data set for AdaptiveLogisticRegression to measure the performance?

Posted by Ted Dunning <te...@gmail.com>.

You don't *have* to have a separate validation set, but it isn't a bad idea.

In particular, with large scale classifiers production data almost always
comes from the future with respect to the training data.  The ADR can't hold
out that way because it does on-line training only.  Thus, I would recommend
recommend that you still have some kind of evaluation hold-out set
segregated by time.

Another very serious issue can happen if you have near duplicates in your
data set.  That often happens in news-wire text, for example.  In that case,
you would have significant over-fitting with ADR and you wouldn't have a
clue without a real time-segregated hold-out set.

On Wed, Jun 1, 2011 at 2:22 AM, Xiaobo Gu <gu...@gmail.com> wrote:

> Hi,
>
> Because ADR split the training data internally automatically,so I
> think we don't have to make a separate validation data set.
>
> Regards,
>
> Xiaobo Gu
>