You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Ted Dunning <te...@gmail.com> on 2014/03/01 05:57:12 UTC

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

I have been swamped.  Generally ad adagrad is a great idea. The code looks fine at first glance.  Certainly some sort of adagrad would be preferable to the hack that I put in. 

Sent from my iPhone

> On Feb 26, 2014, at 18:30, Vishal Santoshi <vi...@gmail.com> wrote:
> 
> Ted,  Any feedback ?
> 
> 
> On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
> <vi...@gmail.com>wrote:
> 
>> Hello Ted,
>> 
>>                  This is regarding AdaGrad update per feature.Have
>> attached  a file which reflects
>> http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
>> 
>> 
>> 
>> It does differ from OnlineLogisticRegression in the way it implements
>> 
>> public double perTermLearningRate(int j) ;
>> 
>> 
>> This class maintains 2 Dense Vectors
>> 
>> /**
>> 
>> * ADA  Per Term Sum of Squares of Learning gradients
>> 
>> */
>> 
>> protected Vector perTermLSumOfSquaresOfGradients;
>> 
>> /**
>> 
>> * ADA Per Term Learning gradient
>> 
>> */
>> 
>> protected Vector perTermGradients;
>> 
>> and it overrides the learn(.... ) method to  update these two vectors
>> respectively.
>> 
>> 
>> 
>> 
>> Please tell me if I am totally off here.
>> 
>> 
>> 
>> Thank you for your help and Regards.
>> 
>> 
>> Vishal Santoshi.
>> 
>> 
>> PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
>> 
>> 
>> 
>> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <te...@gmail.com>wrote:
>> 
>>> :-)
>>> 
>>> Many leaks are *very* subtle.
>>> 
>>> One leak that had me going for weeks was in a news wire corpus.  I
>>> couldn't
>>> figure out why the cross validation was so good and running the classifier
>>> on new data was soooo much worse.
>>> 
>>> The answer was that the training corpus had near-duplicate articles.  This
>>> means that there was leakage between the training and test corpora.  This
>>> wasn't quite a target leak, but it was a leak.
>>> 
>>> For target leaks, it is very common to have partial target leaks due to
>>> the
>>> fact that you learn more about positive cases after the moment that you
>>> had
>>> to select which case to investigate.  Suppose, for instance you are
>>> targeting potential customers based on very limited information.  If you
>>> make an enticing offer to the people you target, then those who accept the
>>> offer will buy something from you.  You will also learn some particulars
>>> such as name and address from those who buy from you.
>>> 
>>> Looking retrospectively, it looks like you can target good customers who
>>> have names or addresses that are not null.  Without a good snapshot of
>>> each
>>> customer record at exactly the time that the targeting was done, you
>>> cannot
>>> know that *all* customers have a null name and address before you target
>>> them.  This sort of time machine leak can be enormously more subtle than
>>> this.
>>> 
>>> 
>>> 
>>>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>> 
>>>> Gokhan
>>>> 
>>>> 
>>>> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>> 
>>>>> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com>
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Are we to assume that SGD is still a work in progress and
>>>>> implementations (
>>>>>> Cross Fold, Online, Adaptive ) are too flawed to be realistically
>>> used
>>>> ?
>>>>>> 
>>>>> 
>>>>> They are too raw to be accepted uncritically, for sure.  They have
>>> been
>>>>> used successfully in production.
>>>>> 
>>>>> 
>>>>>> The evolutionary algorithm seems to be the core of
>>>>>> OnlineLogisticRegression,
>>>>>> which in turn builds up to Adaptive/Cross Fold.
>>>>>> 
>>>>>>>> b) for truly on-line learning where no repeated passes through the
>>>>> data..
>>>>>> 
>>>>>> What would it take to get to an implementation ? How can any one
>>> help ?
>>>>>> 
>>>>> 
>>>>> Would you like to help on this?  The amount of work required to get a
>>>>> distributed asynchronous learner up is moderate, but definitely not
>>> huge.
>>>>> 
>>>> 
>>>> Ted, do you describe a generic distributed learner for all kinds of
>>> online
>>>> algorithms? Possibly zookeeper-coordinated and with #predict and
>>>> #getFeedbackAndUpdateTheModel methods?
>>>> 
>>>>> 
>>>>> I think that OnlineLogisticRegression is basically sound, but should
>>> get
>>>> a
>>>>> better learning rate update equation.  That would largely make the
>>>>> Adaptive* stuff unnecessary, expecially if OLR could be used in the
>>>>> distributed asynchronous learner.
>>>>> 
>>>> 
>>> 
>> 
>> 

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
Yes.  I think that maintaining a learning rate for every parameter that is
being learned is important.  It might help to make that sparse, but I
wouldn't think so.




On Sun, Mar 2, 2014 at 1:33 PM, Vishal Santoshi
<vi...@gmail.com>wrote:

> Should we maintain   (  num_categories  * num_of features )   matrix for
> per term learning rates in a num_categories-way classification ?
>
>
> for( i = 0 ; i < num_categories ;i++){
>
>   for( j = 0 '; j <  num_of features;j++){
>
>            sum_of_squares[i][j] =   sum_of_squares[i][j]
>  +(beta[i][j]*beta[i][j]);
>
>            learning_rates[i][j] =
> (initial_rate/Math.sqrt(sum_of_squares[i][j]))
>  *        beta[i][j]*;*
>
>   }
>
> }
>
>
> *beta *in the base class is rightly   (  num_categories -1  * num_of
> features ) matrix.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I have been swamped.  Generally ad adagrad is a great idea. The code
> looks
> > fine at first glance.  Certainly some sort of adagrad would be preferable
> > to the hack that I put in.
> >
> > Sent from my iPhone
> >
> > > On Feb 26, 2014, at 18:30, Vishal Santoshi <vi...@gmail.com>
> > wrote:
> > >
> > > Ted,  Any feedback ?
> > >
> > >
> > > On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
> > > <vi...@gmail.com>wrote:
> > >
> > >> Hello Ted,
> > >>
> > >>                  This is regarding AdaGrad update per feature.Have
> > >> attached  a file which reflects
> > >> http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
> > >>
> > >>
> > >>
> > >> It does differ from OnlineLogisticRegression in the way it implements
> > >>
> > >> public double perTermLearningRate(int j) ;
> > >>
> > >>
> > >> This class maintains 2 Dense Vectors
> > >>
> > >> /**
> > >>
> > >> * ADA  Per Term Sum of Squares of Learning gradients
> > >>
> > >> */
> > >>
> > >> protected Vector perTermLSumOfSquaresOfGradients;
> > >>
> > >> /**
> > >>
> > >> * ADA Per Term Learning gradient
> > >>
> > >> */
> > >>
> > >> protected Vector perTermGradients;
> > >>
> > >> and it overrides the learn(.... ) method to  update these two vectors
> > >> respectively.
> > >>
> > >>
> > >>
> > >>
> > >> Please tell me if I am totally off here.
> > >>
> > >>
> > >>
> > >> Thank you for your help and Regards.
> > >>
> > >>
> > >> Vishal Santoshi.
> > >>
> > >>
> > >> PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
> > >>
> > >>
> > >>
> > >> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <ted.dunning@gmail.com
> > >wrote:
> > >>
> > >>> :-)
> > >>>
> > >>> Many leaks are *very* subtle.
> > >>>
> > >>> One leak that had me going for weeks was in a news wire corpus.  I
> > >>> couldn't
> > >>> figure out why the cross validation was so good and running the
> > classifier
> > >>> on new data was soooo much worse.
> > >>>
> > >>> The answer was that the training corpus had near-duplicate articles.
> >  This
> > >>> means that there was leakage between the training and test corpora.
> >  This
> > >>> wasn't quite a target leak, but it was a leak.
> > >>>
> > >>> For target leaks, it is very common to have partial target leaks due
> to
> > >>> the
> > >>> fact that you learn more about positive cases after the moment that
> you
> > >>> had
> > >>> to select which case to investigate.  Suppose, for instance you are
> > >>> targeting potential customers based on very limited information.  If
> > you
> > >>> make an enticing offer to the people you target, then those who
> accept
> > the
> > >>> offer will buy something from you.  You will also learn some
> > particulars
> > >>> such as name and address from those who buy from you.
> > >>>
> > >>> Looking retrospectively, it looks like you can target good customers
> > who
> > >>> have names or addresses that are not null.  Without a good snapshot
> of
> > >>> each
> > >>> customer record at exactly the time that the targeting was done, you
> > >>> cannot
> > >>> know that *all* customers have a null name and address before you
> > target
> > >>> them.  This sort of time machine leak can be enormously more subtle
> > than
> > >>> this.
> > >>>
> > >>>
> > >>>
> > >>>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com>
> > wrote:
> > >>>>
> > >>>> Gokhan
> > >>>>
> > >>>>
> > >>>> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <ted.dunning@gmail.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > >>>>> vishal.santoshi@gmail.com>
> > >>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Are we to assume that SGD is still a work in progress and
> > >>>>> implementations (
> > >>>>>> Cross Fold, Online, Adaptive ) are too flawed to be realistically
> > >>> used
> > >>>> ?
> > >>>>>>
> > >>>>>
> > >>>>> They are too raw to be accepted uncritically, for sure.  They have
> > >>> been
> > >>>>> used successfully in production.
> > >>>>>
> > >>>>>
> > >>>>>> The evolutionary algorithm seems to be the core of
> > >>>>>> OnlineLogisticRegression,
> > >>>>>> which in turn builds up to Adaptive/Cross Fold.
> > >>>>>>
> > >>>>>>>> b) for truly on-line learning where no repeated passes through
> the
> > >>>>> data..
> > >>>>>>
> > >>>>>> What would it take to get to an implementation ? How can any one
> > >>> help ?
> > >>>>>>
> > >>>>>
> > >>>>> Would you like to help on this?  The amount of work required to
> get a
> > >>>>> distributed asynchronous learner up is moderate, but definitely not
> > >>> huge.
> > >>>>>
> > >>>>
> > >>>> Ted, do you describe a generic distributed learner for all kinds of
> > >>> online
> > >>>> algorithms? Possibly zookeeper-coordinated and with #predict and
> > >>>> #getFeedbackAndUpdateTheModel methods?
> > >>>>
> > >>>>>
> > >>>>> I think that OnlineLogisticRegression is basically sound, but
> should
> > >>> get
> > >>>> a
> > >>>>> better learning rate update equation.  That would largely make the
> > >>>>> Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > >>>>> distributed asynchronous learner.
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Should we maintain   (  num_categories  * num_of features )   matrix for
per term learning rates in a num_categories-way classification ?


for( i = 0 ; i < num_categories ;i++){

  for( j = 0 '; j <  num_of features;j++){

           sum_of_squares[i][j] =   sum_of_squares[i][j]
 +(beta[i][j]*beta[i][j]);

           learning_rates[i][j] =
(initial_rate/Math.sqrt(sum_of_squares[i][j]))
 *        beta[i][j]*;*

  }

}


*beta *in the base class is rightly   (  num_categories -1  * num_of
features ) matrix.
















On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com> wrote:

> I have been swamped.  Generally ad adagrad is a great idea. The code looks
> fine at first glance.  Certainly some sort of adagrad would be preferable
> to the hack that I put in.
>
> Sent from my iPhone
>
> > On Feb 26, 2014, at 18:30, Vishal Santoshi <vi...@gmail.com>
> wrote:
> >
> > Ted,  Any feedback ?
> >
> >
> > On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
> > <vi...@gmail.com>wrote:
> >
> >> Hello Ted,
> >>
> >>                  This is regarding AdaGrad update per feature.Have
> >> attached  a file which reflects
> >> http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
> >>
> >>
> >>
> >> It does differ from OnlineLogisticRegression in the way it implements
> >>
> >> public double perTermLearningRate(int j) ;
> >>
> >>
> >> This class maintains 2 Dense Vectors
> >>
> >> /**
> >>
> >> * ADA  Per Term Sum of Squares of Learning gradients
> >>
> >> */
> >>
> >> protected Vector perTermLSumOfSquaresOfGradients;
> >>
> >> /**
> >>
> >> * ADA Per Term Learning gradient
> >>
> >> */
> >>
> >> protected Vector perTermGradients;
> >>
> >> and it overrides the learn(.... ) method to  update these two vectors
> >> respectively.
> >>
> >>
> >>
> >>
> >> Please tell me if I am totally off here.
> >>
> >>
> >>
> >> Thank you for your help and Regards.
> >>
> >>
> >> Vishal Santoshi.
> >>
> >>
> >> PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
> >>
> >>
> >>
> >> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>
> >>> :-)
> >>>
> >>> Many leaks are *very* subtle.
> >>>
> >>> One leak that had me going for weeks was in a news wire corpus.  I
> >>> couldn't
> >>> figure out why the cross validation was so good and running the
> classifier
> >>> on new data was soooo much worse.
> >>>
> >>> The answer was that the training corpus had near-duplicate articles.
>  This
> >>> means that there was leakage between the training and test corpora.
>  This
> >>> wasn't quite a target leak, but it was a leak.
> >>>
> >>> For target leaks, it is very common to have partial target leaks due to
> >>> the
> >>> fact that you learn more about positive cases after the moment that you
> >>> had
> >>> to select which case to investigate.  Suppose, for instance you are
> >>> targeting potential customers based on very limited information.  If
> you
> >>> make an enticing offer to the people you target, then those who accept
> the
> >>> offer will buy something from you.  You will also learn some
> particulars
> >>> such as name and address from those who buy from you.
> >>>
> >>> Looking retrospectively, it looks like you can target good customers
> who
> >>> have names or addresses that are not null.  Without a good snapshot of
> >>> each
> >>> customer record at exactly the time that the targeting was done, you
> >>> cannot
> >>> know that *all* customers have a null name and address before you
> target
> >>> them.  This sort of time machine leak can be enormously more subtle
> than
> >>> this.
> >>>
> >>>
> >>>
> >>>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com>
> wrote:
> >>>>
> >>>> Gokhan
> >>>>
> >>>>
> >>>> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> >>>>> vishal.santoshi@gmail.com>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Are we to assume that SGD is still a work in progress and
> >>>>> implementations (
> >>>>>> Cross Fold, Online, Adaptive ) are too flawed to be realistically
> >>> used
> >>>> ?
> >>>>>>
> >>>>>
> >>>>> They are too raw to be accepted uncritically, for sure.  They have
> >>> been
> >>>>> used successfully in production.
> >>>>>
> >>>>>
> >>>>>> The evolutionary algorithm seems to be the core of
> >>>>>> OnlineLogisticRegression,
> >>>>>> which in turn builds up to Adaptive/Cross Fold.
> >>>>>>
> >>>>>>>> b) for truly on-line learning where no repeated passes through the
> >>>>> data..
> >>>>>>
> >>>>>> What would it take to get to an implementation ? How can any one
> >>> help ?
> >>>>>>
> >>>>>
> >>>>> Would you like to help on this?  The amount of work required to get a
> >>>>> distributed asynchronous learner up is moderate, but definitely not
> >>> huge.
> >>>>>
> >>>>
> >>>> Ted, do you describe a generic distributed learner for all kinds of
> >>> online
> >>>> algorithms? Possibly zookeeper-coordinated and with #predict and
> >>>> #getFeedbackAndUpdateTheModel methods?
> >>>>
> >>>>>
> >>>>> I think that OnlineLogisticRegression is basically sound, but should
> >>> get
> >>>> a
> >>>>> better learning rate update equation.  That would largely make the
> >>>>> Adaptive* stuff unnecessary, expecially if OLR could be used in the
> >>>>> distributed asynchronous learner.
> >>>>>
> >>>>
> >>>
> >>
> >>
>