You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by optimusfan <op...@yahoo.com> on 2013/11/26 21:54:06 UTC

Detecting high bias and variance in AdaptiveLogisticRegression classification

Hi-

We're currently working on a binary classifier using Mahout's AdaptiveLogisticRegression class.  We're trying to determine whether or not the models are suffering from high bias or variance and were wondering how to do this using Mahout's APIs?  I can easily calculate the cross validation error and I think I could detect high bias or variance if I could compare that number to my training error, but I'm not sure how to do this.  Or, any other ideas would be appreciated!

Thanks,
Ian

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
No problem at all.  Kind of funny.



On Wed, Nov 27, 2013 at 7:08 AM, Vishal Santoshi
<vi...@gmail.com>wrote:

> Sorry to spam, I never meant the "Hello" to come out as "Hell". Given a
> little disappointment in the mail, I figure I rather spam than be
> misunderstood,
>
>
>
> On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi <
> vishal.santoshi@gmail.com
> > wrote:
>
> > Hell Ted,
> >
> > Are we to assume that SGD is still a work in progress and implementations
> > ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used
> ?
> > The evolutionary algorithm seems to be the core of
> OnlineLogisticRegression,
> > which in turn builds up to Adaptive/Cross Fold.
> >
> > >>b) for truly on-line learning where no repeated passes through the
> > data..
> >
> > What would it take to get to an implementation ? How can any one help ?
> >
> > Regards,
> >
> >
> >
> >
> >
> > On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >
> >> Well, first off, let me say that I am much less of a fan now of the
> >> magical
> >> cross validation approach and adaptation based on that than I was when I
> >> wrote the ALR code.  There are definitely legs in the ideas, but my
> >> implementation has a number of flaws.
> >>
> >> For example:
> >>
> >> a) the way that I provide for handling multiple passes through the data
> is
> >> very easy to screw up.  I think that simply separating the data entirely
> >> might be a better approach.
> >>
> >> b) for truly on-line learning where no repeated passes through the data
> >> will ever occur, then cross validation is not the best choice.  Much
> >> better
> >> in those cases to use what Google researchers described in [1].
> >>
> >> c) it is clear from several reports that the evolutionary algorithm
> >> prematurely shuts down the learning rate.  I think that Adagrad-like
> >> learning rates are more reliable.  See [1] again for one of the more
> >> readable descriptions of this.  See also [2] for another view on
> adaptive
> >> learning rates.
> >>
> >> d) item (c) is also related to the way that learning rates are adapted
> in
> >> the underlying OnlineLogisticRegression.  That needs to be fixed.
> >>
> >> e) asynchronous parallel stochastic gradient descent with mini-batch
> >> learning is where we should be headed.  I do not have time to write it,
> >> however.
> >>
> >> All this aside, I am happy to help in any way that I can given my recent
> >> time limits.
> >>
> >>
> >> [1] http://research.google.com/pubs/pub41159.html
> >>
> >> [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf
> >>
> >>
> >>
> >> On Tue, Nov 26, 2013 at 12:54 PM, optimusfan <op...@yahoo.com>
> >> wrote:
> >>
> >> > Hi-
> >> >
> >> > We're currently working on a binary classifier using
> >> > Mahout's AdaptiveLogisticRegression class.  We're trying to determine
> >> > whether or not the models are suffering from high bias or variance and
> >> were
> >> > wondering how to do this using Mahout's APIs?  I can easily calculate
> >> the
> >> > cross validation error and I think I could detect high bias or
> variance
> >> if
> >> > I could compare that number to my training error, but I'm not sure how
> >> to
> >> > do this.  Or, any other ideas would be appreciated!
> >> >
> >> > Thanks,
> >> > Ian
> >>
> >
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Sorry to spam, I never meant the "Hello" to come out as "Hell". Given a
little disappointment in the mail, I figure I rather spam than be
misunderstood,



On Wed, Nov 27, 2013 at 10:07 AM, Vishal Santoshi <vishal.santoshi@gmail.com
> wrote:

> Hell Ted,
>
> Are we to assume that SGD is still a work in progress and implementations
> ( Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
> The evolutionary algorithm seems to be the core of OnlineLogisticRegression,
> which in turn builds up to Adaptive/Cross Fold.
>
> >>b) for truly on-line learning where no repeated passes through the
> data..
>
> What would it take to get to an implementation ? How can any one help ?
>
> Regards,
>
>
>
>
>
> On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning <te...@gmail.com>wrote:
>
>> Well, first off, let me say that I am much less of a fan now of the
>> magical
>> cross validation approach and adaptation based on that than I was when I
>> wrote the ALR code.  There are definitely legs in the ideas, but my
>> implementation has a number of flaws.
>>
>> For example:
>>
>> a) the way that I provide for handling multiple passes through the data is
>> very easy to screw up.  I think that simply separating the data entirely
>> might be a better approach.
>>
>> b) for truly on-line learning where no repeated passes through the data
>> will ever occur, then cross validation is not the best choice.  Much
>> better
>> in those cases to use what Google researchers described in [1].
>>
>> c) it is clear from several reports that the evolutionary algorithm
>> prematurely shuts down the learning rate.  I think that Adagrad-like
>> learning rates are more reliable.  See [1] again for one of the more
>> readable descriptions of this.  See also [2] for another view on adaptive
>> learning rates.
>>
>> d) item (c) is also related to the way that learning rates are adapted in
>> the underlying OnlineLogisticRegression.  That needs to be fixed.
>>
>> e) asynchronous parallel stochastic gradient descent with mini-batch
>> learning is where we should be headed.  I do not have time to write it,
>> however.
>>
>> All this aside, I am happy to help in any way that I can given my recent
>> time limits.
>>
>>
>> [1] http://research.google.com/pubs/pub41159.html
>>
>> [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf
>>
>>
>>
>> On Tue, Nov 26, 2013 at 12:54 PM, optimusfan <op...@yahoo.com>
>> wrote:
>>
>> > Hi-
>> >
>> > We're currently working on a binary classifier using
>> > Mahout's AdaptiveLogisticRegression class.  We're trying to determine
>> > whether or not the models are suffering from high bias or variance and
>> were
>> > wondering how to do this using Mahout's APIs?  I can easily calculate
>> the
>> > cross validation error and I think I could detect high bias or variance
>> if
>> > I could compare that number to my training error, but I'm not sure how
>> to
>> > do this.  Or, any other ideas would be appreciated!
>> >
>> > Thanks,
>> > Ian
>>
>
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by optimusfan <op...@yahoo.com>.



>> We've been playing around with a number of different parameters, feature
>> selection, etc. and are able to achieve pretty good results in
>> cross-validation.
>>
>>When you say cross validation, do you mean the magic cross validation that
>>the ALR uses?  Or do you mean your 20%?

I mean the 20%.  Does the ALR algorithm do it's own cross validation?  I was under the impression that it did training and testing steps with a percentage split based on the number of something (CrossFoldLearners?) in the object.  Is that correct?  As I said, we've been holding back 20% to do our own cross validation.

>>  We have a ton of different metrics we're tracking on the results, most
>> significant to this discussion is that it looks like we're achieving very
>> good precision (typically >.85 or .9) and a good f1-score (typically again
>> >.85 or .9).
>>
>>These are extremely good results.   In fact they are good enough I would
>>starting thinking about a target leak.

The possibility of a target leak is interesting as it hadn't occurred to me previously.  However, thinking it through I'm less inclined to think it's a possibility.  We wrote a simple program to extract the model features and weights and I would think a leak would be obvious there, yes?  The terms we're seeing seem to make sense.

>>However, when we then take the models generated and try to apply them to
>> some new documents, we're getting many more false positives than we would
>> expect.  Documents that should have 2 categories are testing positive for
>> 16, which is well above what I'd expect.  By my math I should expect 2 true
>> positives, plus maybe 4.4 (.10 false positives * 44 classes) additional
>> false positives.
>>
>>
>>You said documents.  Where do these documents come from?

Sorry, to clarify all of our inputs are documents.  Specifically, they're technical (scientific) papers written by people at our company.  The documents are indexed in SOLR, and we use the Mahout lucene vector to extract our data.  We started our development of this process a couple of months ago and took an extract from SOLR at that time.  The new documents we're trying to classify after settling on a model are those that have come in to SOLR after that extraction took place.

>>One way to get results just like you describe is if you train on raw news
>>wire that is split randomly between training and test.  What can happen is
>>that stories that get edited and republished have a high chance of getting
>>at least one version in both training and test.  This means that the
>>supposedly independent test set actually has significant overlap with the
>>training set.  If your classifier over-fits, then the test set doesn't
>>catch the problem.

I don't believe this is happening, but it is worth checking into.  

>>Another way to get this sort of problem is if you do your training/test
>>randomly, but the new documents come from a later time.  If your classifier
>>is a good classifier, but is highly specific to documents from a particular
>>moment in time, then your test performance will be a realistic estimate of
>>performance for contemporaneous documents but will be much higher than
>>performance on documents from a later point in time.

The temporal aspect is an interesting one.  I will have to check on that.

>>A third option could happen if your training and test sets were somehow
>>scrubbed of poorly structured and invalid documents.  This often happens.
>>Then, in the real system, if the scrubbing is not done, the classifier may
>>fail because the new documents are not scrubbed in the same way as the
>>training documents.

I think we've handled this.  I'm processing "new" documents programmatically through an analysis chain that I believe accurately mimics the one that I indexed against in SOLR.  The results were complete garbage before I made them match exactly.  In addition, wouldn't I expect more false negatives than false positives if that was the case?

>>Well, I think that, almost by definition, you have an overfitting problem
>>of some kind.  The question is what kind.  The only think that I think that
>>you don't have is a frank target leak in your documents.  That would
>>(probably) have given you even higher scores on your test case.
Is there any easy way to detect an overfit?  We've noticed at least one interesting thing that seem to be typical of the bad models.  For each class a percentage "confidence" score is reported.  With our binary models obviously the choices are 0 or 1.   The bad models tend to be very certain in their answers -- e.g. it's either >99% certain it is or isn't a particular class.  Is that indicative of overfitting, or completely unrelated?

THANKS!
Ian

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
Inline


On Mon, Dec 2, 2013 at 8:55 AM, optimusfan <op...@yahoo.com> wrote:

> ... To accomplish this, we used AdaptiveLogisticRegression and trained 46
> binary classification models.  Our approach has been to do an 80/20 split
> on the data, holding the 20% back for cross-validation of the models we
> generate.
>

Sounds reasonable.


> We've been playing around with a number of different parameters, feature
> selection, etc. and are able to achieve pretty good results in
> cross-validation.


When you say cross validation, do you mean the magic cross validation that
the ALR uses?  Or do you mean your 20%?


>  We have a ton of different metrics we're tracking on the results, most
> significant to this discussion is that it looks like we're achieving very
> good precision (typically >.85 or .9) and a good f1-score (typically again
> >.85 or .9).


These are extremely good results.   In fact they are good enough I would
starting thinking about a target leak.

 However, when we then take the models generated and try to apply them to
> some new documents, we're getting many more false positives than we would
> expect.  Documents that should have 2 categories are testing positive for
> 16, which is well above what I'd expect.  By my math I should expect 2 true
> positives, plus maybe 4.4 (.10 false positives * 44 classes) additional
> false positives.
>

You said documents.  Where do these documents come from?

One way to get results just like you describe is if you train on raw news
wire that is split randomly between training and test.  What can happen is
that stories that get edited and republished have a high chance of getting
at least one version in both training and test.  This means that the
supposedly independent test set actually has significant overlap with the
training set.  If your classifier over-fits, then the test set doesn't
catch the problem.

Another way to get this sort of problem is if you do your training/test
randomly, but the new documents come from a later time.  If your classifier
is a good classifier, but is highly specific to documents from a particular
moment in time, then your test performance will be a realistic estimate of
performance for contemporaneous documents but will be much higher than
performance on documents from a later point in time.

A third option could happen if your training and test sets were somehow
scrubbed of poorly structured and invalid documents.  This often happens.
 Then, in the real system, if the scrubbing is not done, the classifier may
fail because the new documents are not scrubbed in the same way as the
training documents.

These are just a few of the ways that *I* have screwed up building
classifiers.  I am sure that there are more.

We suspected that perhaps our models were underfitting or overfitting,
> hence this post.  However, I'll take any and all suggestions for anything
> else we should be looking at.
>

Well, I think that, almost by definition, you have an overfitting problem
of some kind.  The question is what kind.  The only think that I think that
you don't have is a frank target leak in your documents.  That would
(probably) have given you even higher scores on your test case.

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by optimusfan <op...@yahoo.com>.
Ted-

Thanks for the response.  Just getting back after the holiday weekend and am catching up on this.  Let me be more specific in what we're doing and what we're seeing in terms of results.  Our goal was to created a classifier that could assign one or more of 46 categories to various documents that it sees.  To accomplish this, we used AdaptiveLogisticRegression and trained 46 binary classification models.  Our approach has been to do an 80/20 split on the data, holding the 20% back for cross-validation of the models we generate.

We've been playing around with a number of different parameters, feature selection, etc. and are able to achieve pretty good results in cross-validation.  We have a ton of different metrics we're tracking on the results, most significant to this discussion is that it looks like we're achieving very good precision (typically >.85 or .9) and a good f1-score (typically again >.85 or .9).  However, when we then take the models generated and try to apply them to some new documents, we're getting many more false positives than we would expect.  Documents that should have 2 categories are testing positive for 16, which is well above what I'd expect.  By my math I should expect 2 true positives, plus maybe 4.4 (.10 false positives * 44 classes) additional false positives.

We suspected that perhaps our models were underfitting or overfitting, hence this post.  However, I'll take any and all suggestions for anything else we should be looking at.

Thanks,
Ian



On Thursday, November 28, 2013 2:20 PM, Ted Dunning <te...@gmail.com> wrote:
 
Yes.  Exactly.



On Thu, Nov 28, 2013 at 6:32 AM, Vishal Santoshi
<vi...@gmail.com>wrote:

> Absolutely. I will read through.  The idea is to first  fix the learning
> rate update equation in OLR.
> I think this code  in  OnlineLogisticRegression is the current equation ?
>
> @Override
>
>   public double currentLearningRate() {
>
>     return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
> stepOffset, forgettingExponent);
>
>   }
>
>
> I presume that you would like  Adagrad-like solution to replace the above ?
>
>
>
>
>
>
> On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > vishal.santoshi@gmail.com>
> >
> > >
> > >
> > > Are we to assume that SGD is still a work in progress and
> > implementations (
> > > Cross Fold, Online, Adaptive ) are too flawed to be realistically used
> ?
> > >
> >
> > They are too raw to be accepted uncritically, for sure.  They have been
> > used successfully in production.
> >
> >
> > > The evolutionary algorithm seems to be the core of
> > > OnlineLogisticRegression,
> > > which in turn builds up to Adaptive/Cross Fold.
> > >
> > > >>b) for truly on-line learning where no repeated passes through the
> > data..
> > >
> > > What would it take to get to an implementation ? How can any one help ?
> > >
> >
> > Would you like to help on this?  The amount of work required to get a
> > distributed asynchronous learner up is moderate, but definitely not huge.
> >
> > I think that OnlineLogisticRegression is basically sound, but should get
> a
> > better learning rate update equation.  That would largely make the
> > Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > distributed asynchronous learner.
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
Yes.  Exactly.


On Thu, Nov 28, 2013 at 6:32 AM, Vishal Santoshi
<vi...@gmail.com>wrote:

> Absolutely. I will read through.  The idea is to first  fix the learning
> rate update equation in OLR.
> I think this code  in  OnlineLogisticRegression is the current equation ?
>
> @Override
>
>   public double currentLearningRate() {
>
>     return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
> stepOffset, forgettingExponent);
>
>   }
>
>
> I presume that you would like  Adagrad-like solution to replace the above ?
>
>
>
>
>
>
> On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > vishal.santoshi@gmail.com>
> >
> > >
> > >
> > > Are we to assume that SGD is still a work in progress and
> > implementations (
> > > Cross Fold, Online, Adaptive ) are too flawed to be realistically used
> ?
> > >
> >
> > They are too raw to be accepted uncritically, for sure.  They have been
> > used successfully in production.
> >
> >
> > > The evolutionary algorithm seems to be the core of
> > > OnlineLogisticRegression,
> > > which in turn builds up to Adaptive/Cross Fold.
> > >
> > > >>b) for truly on-line learning where no repeated passes through the
> > data..
> > >
> > > What would it take to get to an implementation ? How can any one help ?
> > >
> >
> > Would you like to help on this?  The amount of work required to get a
> > distributed asynchronous learner up is moderate, but definitely not huge.
> >
> > I think that OnlineLogisticRegression is basically sound, but should get
> a
> > better learning rate update equation.  That would largely make the
> > Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > distributed asynchronous learner.
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Absolutely. I will read through.  The idea is to first  fix the learning
rate update equation in OLR.
I think this code  in  OnlineLogisticRegression is the current equation ?

@Override

  public double currentLearningRate() {

    return mu0 * Math.pow(decayFactor, getStep()) * Math.pow(getStep() +
stepOffset, forgettingExponent);

  }


I presume that you would like  Adagrad-like solution to replace the above ?






On Wed, Nov 27, 2013 at 8:18 PM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> vishal.santoshi@gmail.com>
>
> >
> >
> > Are we to assume that SGD is still a work in progress and
> implementations (
> > Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
> >
>
> They are too raw to be accepted uncritically, for sure.  They have been
> used successfully in production.
>
>
> > The evolutionary algorithm seems to be the core of
> > OnlineLogisticRegression,
> > which in turn builds up to Adaptive/Cross Fold.
> >
> > >>b) for truly on-line learning where no repeated passes through the
> data..
> >
> > What would it take to get to an implementation ? How can any one help ?
> >
>
> Would you like to help on this?  The amount of work required to get a
> distributed asynchronous learner up is moderate, but definitely not huge.
>
> I think that OnlineLogisticRegression is basically sound, but should get a
> better learning rate update equation.  That would largely make the
> Adaptive* stuff unnecessary, expecially if OLR could be used in the
> distributed asynchronous learner.
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
I do see the regularize has  the prior ( LI and L2 )  depend on *
perTermLearningRate(j))
...*


On Thu, Feb 20, 2014 at 11:49 AM, Vishal Santoshi <vishal.santoshi@gmail.com
> wrote:

> Hey Ted,
>
> >> I presume that you would like  Adagrad-like solution to replace the
> above ?
>
> Things that I could glean out.
>
>
>
>
>  *  Maintain a simple d-dimensional vector representing to store a running
> total of the squares of the gradients, where d is the number of terms.  Say
> *gradients*.
>
>
>
>
> *  Based on
>
>      "Since the learning rate for each feature is quickly adapted, the
> value for is far less important than it is with SGD. I have used = 1:0 for
> a very large number of different problems. The primary role of
>      is to determine how much a feature changes the very first time it is
> encountered, so in problems with large numbers of extremely rare features,
> some additional care may be warranted."
>
>      *How important or even necessary is  perTermLearningRate(j)  ?*
>
>
>
>
> *  double newValue = beta.getQuick(i, j) + gradientBase * learningRate *
> perTermLearningRate(j) * instance.get(j);
>
>    becomes
>
>     double newGradient = beta.getQuick(i, j) + ( learningRate / Math.sqrt(
> *gradients*(i)) )* instance.get(j);
>
>     *gradients*(i)  = *gradients*(i) + newGradient ^2;
>
>
>
>
>
> Does this make sense ? The only thing is that the abstract class changes.
>
>
> Regards.
>
>
>
>
> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> :-)
>>
>> Many leaks are *very* subtle.
>>
>> One leak that had me going for weeks was in a news wire corpus.  I
>> couldn't
>> figure out why the cross validation was so good and running the classifier
>> on new data was soooo much worse.
>>
>> The answer was that the training corpus had near-duplicate articles.  This
>> means that there was leakage between the training and test corpora.  This
>> wasn't quite a target leak, but it was a leak.
>>
>> For target leaks, it is very common to have partial target leaks due to
>> the
>> fact that you learn more about positive cases after the moment that you
>> had
>> to select which case to investigate.  Suppose, for instance you are
>> targeting potential customers based on very limited information.  If you
>> make an enticing offer to the people you target, then those who accept the
>> offer will buy something from you.  You will also learn some particulars
>> such as name and address from those who buy from you.
>>
>> Looking retrospectively, it looks like you can target good customers who
>> have names or addresses that are not null.  Without a good snapshot of
>> each
>> customer record at exactly the time that the targeting was done, you
>> cannot
>> know that *all* customers have a null name and address before you target
>> them.  This sort of time machine leak can be enormously more subtle than
>> this.
>>
>>
>>
>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> > Gokhan
>> >
>> >
>> > On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
>> > > vishal.santoshi@gmail.com>
>> > >
>> > > >
>> > > >
>> > > > Are we to assume that SGD is still a work in progress and
>> > > implementations (
>> > > > Cross Fold, Online, Adaptive ) are too flawed to be realistically
>> used
>> > ?
>> > > >
>> > >
>> > > They are too raw to be accepted uncritically, for sure.  They have
>> been
>> > > used successfully in production.
>> > >
>> > >
>> > > > The evolutionary algorithm seems to be the core of
>> > > > OnlineLogisticRegression,
>> > > > which in turn builds up to Adaptive/Cross Fold.
>> > > >
>> > > > >>b) for truly on-line learning where no repeated passes through the
>> > > data..
>> > > >
>> > > > What would it take to get to an implementation ? How can any one
>> help ?
>> > > >
>> > >
>> > > Would you like to help on this?  The amount of work required to get a
>> > > distributed asynchronous learner up is moderate, but definitely not
>> huge.
>> > >
>> >
>> > Ted, do you describe a generic distributed learner for all kinds of
>> online
>> > algorithms? Possibly zookeeper-coordinated and with #predict and
>> > #getFeedbackAndUpdateTheModel methods?
>> >
>> > >
>> > > I think that OnlineLogisticRegression is basically sound, but should
>> get
>> > a
>> > > better learning rate update equation.  That would largely make the
>> > > Adaptive* stuff unnecessary, expecially if OLR could be used in the
>> > > distributed asynchronous learner.
>> > >
>> >
>>
>
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Hey Ted,

>> I presume that you would like  Adagrad-like solution to replace the
above ?

Things that I could glean out.




 *  Maintain a simple d-dimensional vector representing to store a running
total of the squares of the gradients, where d is the number of terms.  Say
*gradients*.




*  Based on

     "Since the learning rate for each feature is quickly adapted, the
value for  is far less important than it is with SGD. I have used  = 1:0
for a very large number of different problems. The primary role of
     is to determine how much a feature changes the very first time it is
encountered, so in problems with large numbers of extremely rare features,
some additional care may be warranted."

     *How important or even necessary is  perTermLearningRate(j)  ?*




*  double newValue = beta.getQuick(i, j) + gradientBase * learningRate *
perTermLearningRate(j) * instance.get(j);

   becomes

    double newGradient = beta.getQuick(i, j) + ( learningRate / Math.sqrt(
*gradients*(i)) )* instance.get(j);

    *gradients*(i)  = *gradients*(i) + newGradient ^2;





Does this make sense ? The only thing is that the abstract class changes.


Regards.




On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <te...@gmail.com> wrote:

> :-)
>
> Many leaks are *very* subtle.
>
> One leak that had me going for weeks was in a news wire corpus.  I couldn't
> figure out why the cross validation was so good and running the classifier
> on new data was soooo much worse.
>
> The answer was that the training corpus had near-duplicate articles.  This
> means that there was leakage between the training and test corpora.  This
> wasn't quite a target leak, but it was a leak.
>
> For target leaks, it is very common to have partial target leaks due to the
> fact that you learn more about positive cases after the moment that you had
> to select which case to investigate.  Suppose, for instance you are
> targeting potential customers based on very limited information.  If you
> make an enticing offer to the people you target, then those who accept the
> offer will buy something from you.  You will also learn some particulars
> such as name and address from those who buy from you.
>
> Looking retrospectively, it looks like you can target good customers who
> have names or addresses that are not null.  Without a good snapshot of each
> customer record at exactly the time that the targeting was done, you cannot
> know that *all* customers have a null name and address before you target
> them.  This sort of time machine leak can be enormously more subtle than
> this.
>
>
>
> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com> wrote:
>
> > Gokhan
> >
> >
> > On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > > vishal.santoshi@gmail.com>
> > >
> > > >
> > > >
> > > > Are we to assume that SGD is still a work in progress and
> > > implementations (
> > > > Cross Fold, Online, Adaptive ) are too flawed to be realistically
> used
> > ?
> > > >
> > >
> > > They are too raw to be accepted uncritically, for sure.  They have been
> > > used successfully in production.
> > >
> > >
> > > > The evolutionary algorithm seems to be the core of
> > > > OnlineLogisticRegression,
> > > > which in turn builds up to Adaptive/Cross Fold.
> > > >
> > > > >>b) for truly on-line learning where no repeated passes through the
> > > data..
> > > >
> > > > What would it take to get to an implementation ? How can any one
> help ?
> > > >
> > >
> > > Would you like to help on this?  The amount of work required to get a
> > > distributed asynchronous learner up is moderate, but definitely not
> huge.
> > >
> >
> > Ted, do you describe a generic distributed learner for all kinds of
> online
> > algorithms? Possibly zookeeper-coordinated and with #predict and
> > #getFeedbackAndUpdateTheModel methods?
> >
> > >
> > > I think that OnlineLogisticRegression is basically sound, but should
> get
> > a
> > > better learning rate update equation.  That would largely make the
> > > Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > > distributed asynchronous learner.
> > >
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
Yes.  I think that maintaining a learning rate for every parameter that is
being learned is important.  It might help to make that sparse, but I
wouldn't think so.




On Sun, Mar 2, 2014 at 1:33 PM, Vishal Santoshi
<vi...@gmail.com>wrote:

> Should we maintain   (  num_categories  * num_of features )   matrix for
> per term learning rates in a num_categories-way classification ?
>
>
> for( i = 0 ; i < num_categories ;i++){
>
>   for( j = 0 '; j <  num_of features;j++){
>
>            sum_of_squares[i][j] =   sum_of_squares[i][j]
>  +(beta[i][j]*beta[i][j]);
>
>            learning_rates[i][j] =
> (initial_rate/Math.sqrt(sum_of_squares[i][j]))
>  *        beta[i][j]*;*
>
>   }
>
> }
>
>
> *beta *in the base class is rightly   (  num_categories -1  * num_of
> features ) matrix.
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
>
> On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > I have been swamped.  Generally ad adagrad is a great idea. The code
> looks
> > fine at first glance.  Certainly some sort of adagrad would be preferable
> > to the hack that I put in.
> >
> > Sent from my iPhone
> >
> > > On Feb 26, 2014, at 18:30, Vishal Santoshi <vi...@gmail.com>
> > wrote:
> > >
> > > Ted,  Any feedback ?
> > >
> > >
> > > On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
> > > <vi...@gmail.com>wrote:
> > >
> > >> Hello Ted,
> > >>
> > >>                  This is regarding AdaGrad update per feature.Have
> > >> attached  a file which reflects
> > >> http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
> > >>
> > >>
> > >>
> > >> It does differ from OnlineLogisticRegression in the way it implements
> > >>
> > >> public double perTermLearningRate(int j) ;
> > >>
> > >>
> > >> This class maintains 2 Dense Vectors
> > >>
> > >> /**
> > >>
> > >> * ADA  Per Term Sum of Squares of Learning gradients
> > >>
> > >> */
> > >>
> > >> protected Vector perTermLSumOfSquaresOfGradients;
> > >>
> > >> /**
> > >>
> > >> * ADA Per Term Learning gradient
> > >>
> > >> */
> > >>
> > >> protected Vector perTermGradients;
> > >>
> > >> and it overrides the learn(.... ) method to  update these two vectors
> > >> respectively.
> > >>
> > >>
> > >>
> > >>
> > >> Please tell me if I am totally off here.
> > >>
> > >>
> > >>
> > >> Thank you for your help and Regards.
> > >>
> > >>
> > >> Vishal Santoshi.
> > >>
> > >>
> > >> PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
> > >>
> > >>
> > >>
> > >> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <ted.dunning@gmail.com
> > >wrote:
> > >>
> > >>> :-)
> > >>>
> > >>> Many leaks are *very* subtle.
> > >>>
> > >>> One leak that had me going for weeks was in a news wire corpus.  I
> > >>> couldn't
> > >>> figure out why the cross validation was so good and running the
> > classifier
> > >>> on new data was soooo much worse.
> > >>>
> > >>> The answer was that the training corpus had near-duplicate articles.
> >  This
> > >>> means that there was leakage between the training and test corpora.
> >  This
> > >>> wasn't quite a target leak, but it was a leak.
> > >>>
> > >>> For target leaks, it is very common to have partial target leaks due
> to
> > >>> the
> > >>> fact that you learn more about positive cases after the moment that
> you
> > >>> had
> > >>> to select which case to investigate.  Suppose, for instance you are
> > >>> targeting potential customers based on very limited information.  If
> > you
> > >>> make an enticing offer to the people you target, then those who
> accept
> > the
> > >>> offer will buy something from you.  You will also learn some
> > particulars
> > >>> such as name and address from those who buy from you.
> > >>>
> > >>> Looking retrospectively, it looks like you can target good customers
> > who
> > >>> have names or addresses that are not null.  Without a good snapshot
> of
> > >>> each
> > >>> customer record at exactly the time that the targeting was done, you
> > >>> cannot
> > >>> know that *all* customers have a null name and address before you
> > target
> > >>> them.  This sort of time machine leak can be enormously more subtle
> > than
> > >>> this.
> > >>>
> > >>>
> > >>>
> > >>>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com>
> > wrote:
> > >>>>
> > >>>> Gokhan
> > >>>>
> > >>>>
> > >>>> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <ted.dunning@gmail.com
> >
> > >>>> wrote:
> > >>>>
> > >>>>> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > >>>>> vishal.santoshi@gmail.com>
> > >>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>> Are we to assume that SGD is still a work in progress and
> > >>>>> implementations (
> > >>>>>> Cross Fold, Online, Adaptive ) are too flawed to be realistically
> > >>> used
> > >>>> ?
> > >>>>>>
> > >>>>>
> > >>>>> They are too raw to be accepted uncritically, for sure.  They have
> > >>> been
> > >>>>> used successfully in production.
> > >>>>>
> > >>>>>
> > >>>>>> The evolutionary algorithm seems to be the core of
> > >>>>>> OnlineLogisticRegression,
> > >>>>>> which in turn builds up to Adaptive/Cross Fold.
> > >>>>>>
> > >>>>>>>> b) for truly on-line learning where no repeated passes through
> the
> > >>>>> data..
> > >>>>>>
> > >>>>>> What would it take to get to an implementation ? How can any one
> > >>> help ?
> > >>>>>>
> > >>>>>
> > >>>>> Would you like to help on this?  The amount of work required to
> get a
> > >>>>> distributed asynchronous learner up is moderate, but definitely not
> > >>> huge.
> > >>>>>
> > >>>>
> > >>>> Ted, do you describe a generic distributed learner for all kinds of
> > >>> online
> > >>>> algorithms? Possibly zookeeper-coordinated and with #predict and
> > >>>> #getFeedbackAndUpdateTheModel methods?
> > >>>>
> > >>>>>
> > >>>>> I think that OnlineLogisticRegression is basically sound, but
> should
> > >>> get
> > >>>> a
> > >>>>> better learning rate update equation.  That would largely make the
> > >>>>> Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > >>>>> distributed asynchronous learner.
> > >>>>>
> > >>>>
> > >>>
> > >>
> > >>
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Should we maintain   (  num_categories  * num_of features )   matrix for
per term learning rates in a num_categories-way classification ?


for( i = 0 ; i < num_categories ;i++){

  for( j = 0 '; j <  num_of features;j++){

           sum_of_squares[i][j] =   sum_of_squares[i][j]
 +(beta[i][j]*beta[i][j]);

           learning_rates[i][j] =
(initial_rate/Math.sqrt(sum_of_squares[i][j]))
 *        beta[i][j]*;*

  }

}


*beta *in the base class is rightly   (  num_categories -1  * num_of
features ) matrix.
















On Fri, Feb 28, 2014 at 11:57 PM, Ted Dunning <te...@gmail.com> wrote:

> I have been swamped.  Generally ad adagrad is a great idea. The code looks
> fine at first glance.  Certainly some sort of adagrad would be preferable
> to the hack that I put in.
>
> Sent from my iPhone
>
> > On Feb 26, 2014, at 18:30, Vishal Santoshi <vi...@gmail.com>
> wrote:
> >
> > Ted,  Any feedback ?
> >
> >
> > On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
> > <vi...@gmail.com>wrote:
> >
> >> Hello Ted,
> >>
> >>                  This is regarding AdaGrad update per feature.Have
> >> attached  a file which reflects
> >> http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
> >>
> >>
> >>
> >> It does differ from OnlineLogisticRegression in the way it implements
> >>
> >> public double perTermLearningRate(int j) ;
> >>
> >>
> >> This class maintains 2 Dense Vectors
> >>
> >> /**
> >>
> >> * ADA  Per Term Sum of Squares of Learning gradients
> >>
> >> */
> >>
> >> protected Vector perTermLSumOfSquaresOfGradients;
> >>
> >> /**
> >>
> >> * ADA Per Term Learning gradient
> >>
> >> */
> >>
> >> protected Vector perTermGradients;
> >>
> >> and it overrides the learn(.... ) method to  update these two vectors
> >> respectively.
> >>
> >>
> >>
> >>
> >> Please tell me if I am totally off here.
> >>
> >>
> >>
> >> Thank you for your help and Regards.
> >>
> >>
> >> Vishal Santoshi.
> >>
> >>
> >> PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
> >>
> >>
> >>
> >> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <ted.dunning@gmail.com
> >wrote:
> >>
> >>> :-)
> >>>
> >>> Many leaks are *very* subtle.
> >>>
> >>> One leak that had me going for weeks was in a news wire corpus.  I
> >>> couldn't
> >>> figure out why the cross validation was so good and running the
> classifier
> >>> on new data was soooo much worse.
> >>>
> >>> The answer was that the training corpus had near-duplicate articles.
>  This
> >>> means that there was leakage between the training and test corpora.
>  This
> >>> wasn't quite a target leak, but it was a leak.
> >>>
> >>> For target leaks, it is very common to have partial target leaks due to
> >>> the
> >>> fact that you learn more about positive cases after the moment that you
> >>> had
> >>> to select which case to investigate.  Suppose, for instance you are
> >>> targeting potential customers based on very limited information.  If
> you
> >>> make an enticing offer to the people you target, then those who accept
> the
> >>> offer will buy something from you.  You will also learn some
> particulars
> >>> such as name and address from those who buy from you.
> >>>
> >>> Looking retrospectively, it looks like you can target good customers
> who
> >>> have names or addresses that are not null.  Without a good snapshot of
> >>> each
> >>> customer record at exactly the time that the targeting was done, you
> >>> cannot
> >>> know that *all* customers have a null name and address before you
> target
> >>> them.  This sort of time machine leak can be enormously more subtle
> than
> >>> this.
> >>>
> >>>
> >>>
> >>>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com>
> wrote:
> >>>>
> >>>> Gokhan
> >>>>
> >>>>
> >>>> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
> >>>> wrote:
> >>>>
> >>>>> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> >>>>> vishal.santoshi@gmail.com>
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>> Are we to assume that SGD is still a work in progress and
> >>>>> implementations (
> >>>>>> Cross Fold, Online, Adaptive ) are too flawed to be realistically
> >>> used
> >>>> ?
> >>>>>>
> >>>>>
> >>>>> They are too raw to be accepted uncritically, for sure.  They have
> >>> been
> >>>>> used successfully in production.
> >>>>>
> >>>>>
> >>>>>> The evolutionary algorithm seems to be the core of
> >>>>>> OnlineLogisticRegression,
> >>>>>> which in turn builds up to Adaptive/Cross Fold.
> >>>>>>
> >>>>>>>> b) for truly on-line learning where no repeated passes through the
> >>>>> data..
> >>>>>>
> >>>>>> What would it take to get to an implementation ? How can any one
> >>> help ?
> >>>>>>
> >>>>>
> >>>>> Would you like to help on this?  The amount of work required to get a
> >>>>> distributed asynchronous learner up is moderate, but definitely not
> >>> huge.
> >>>>>
> >>>>
> >>>> Ted, do you describe a generic distributed learner for all kinds of
> >>> online
> >>>> algorithms? Possibly zookeeper-coordinated and with #predict and
> >>>> #getFeedbackAndUpdateTheModel methods?
> >>>>
> >>>>>
> >>>>> I think that OnlineLogisticRegression is basically sound, but should
> >>> get
> >>>> a
> >>>>> better learning rate update equation.  That would largely make the
> >>>>> Adaptive* stuff unnecessary, expecially if OLR could be used in the
> >>>>> distributed asynchronous learner.
> >>>>>
> >>>>
> >>>
> >>
> >>
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
I have been swamped.  Generally ad adagrad is a great idea. The code looks fine at first glance.  Certainly some sort of adagrad would be preferable to the hack that I put in. 

Sent from my iPhone

> On Feb 26, 2014, at 18:30, Vishal Santoshi <vi...@gmail.com> wrote:
> 
> Ted,  Any feedback ?
> 
> 
> On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
> <vi...@gmail.com>wrote:
> 
>> Hello Ted,
>> 
>>                  This is regarding AdaGrad update per feature.Have
>> attached  a file which reflects
>> http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
>> 
>> 
>> 
>> It does differ from OnlineLogisticRegression in the way it implements
>> 
>> public double perTermLearningRate(int j) ;
>> 
>> 
>> This class maintains 2 Dense Vectors
>> 
>> /**
>> 
>> * ADA  Per Term Sum of Squares of Learning gradients
>> 
>> */
>> 
>> protected Vector perTermLSumOfSquaresOfGradients;
>> 
>> /**
>> 
>> * ADA Per Term Learning gradient
>> 
>> */
>> 
>> protected Vector perTermGradients;
>> 
>> and it overrides the learn(.... ) method to  update these two vectors
>> respectively.
>> 
>> 
>> 
>> 
>> Please tell me if I am totally off here.
>> 
>> 
>> 
>> Thank you for your help and Regards.
>> 
>> 
>> Vishal Santoshi.
>> 
>> 
>> PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
>> 
>> 
>> 
>> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <te...@gmail.com>wrote:
>> 
>>> :-)
>>> 
>>> Many leaks are *very* subtle.
>>> 
>>> One leak that had me going for weeks was in a news wire corpus.  I
>>> couldn't
>>> figure out why the cross validation was so good and running the classifier
>>> on new data was soooo much worse.
>>> 
>>> The answer was that the training corpus had near-duplicate articles.  This
>>> means that there was leakage between the training and test corpora.  This
>>> wasn't quite a target leak, but it was a leak.
>>> 
>>> For target leaks, it is very common to have partial target leaks due to
>>> the
>>> fact that you learn more about positive cases after the moment that you
>>> had
>>> to select which case to investigate.  Suppose, for instance you are
>>> targeting potential customers based on very limited information.  If you
>>> make an enticing offer to the people you target, then those who accept the
>>> offer will buy something from you.  You will also learn some particulars
>>> such as name and address from those who buy from you.
>>> 
>>> Looking retrospectively, it looks like you can target good customers who
>>> have names or addresses that are not null.  Without a good snapshot of
>>> each
>>> customer record at exactly the time that the targeting was done, you
>>> cannot
>>> know that *all* customers have a null name and address before you target
>>> them.  This sort of time machine leak can be enormously more subtle than
>>> this.
>>> 
>>> 
>>> 
>>>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>>> 
>>>> Gokhan
>>>> 
>>>> 
>>>> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
>>>> wrote:
>>>> 
>>>>> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
>>>>> vishal.santoshi@gmail.com>
>>>>> 
>>>>>> 
>>>>>> 
>>>>>> Are we to assume that SGD is still a work in progress and
>>>>> implementations (
>>>>>> Cross Fold, Online, Adaptive ) are too flawed to be realistically
>>> used
>>>> ?
>>>>>> 
>>>>> 
>>>>> They are too raw to be accepted uncritically, for sure.  They have
>>> been
>>>>> used successfully in production.
>>>>> 
>>>>> 
>>>>>> The evolutionary algorithm seems to be the core of
>>>>>> OnlineLogisticRegression,
>>>>>> which in turn builds up to Adaptive/Cross Fold.
>>>>>> 
>>>>>>>> b) for truly on-line learning where no repeated passes through the
>>>>> data..
>>>>>> 
>>>>>> What would it take to get to an implementation ? How can any one
>>> help ?
>>>>>> 
>>>>> 
>>>>> Would you like to help on this?  The amount of work required to get a
>>>>> distributed asynchronous learner up is moderate, but definitely not
>>> huge.
>>>>> 
>>>> 
>>>> Ted, do you describe a generic distributed learner for all kinds of
>>> online
>>>> algorithms? Possibly zookeeper-coordinated and with #predict and
>>>> #getFeedbackAndUpdateTheModel methods?
>>>> 
>>>>> 
>>>>> I think that OnlineLogisticRegression is basically sound, but should
>>> get
>>>> a
>>>>> better learning rate update equation.  That would largely make the
>>>>> Adaptive* stuff unnecessary, expecially if OLR could be used in the
>>>>> distributed asynchronous learner.
>>>>> 
>>>> 
>>> 
>> 
>> 

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Ted,  Any feedback ?


On Mon, Feb 24, 2014 at 2:58 PM, Vishal Santoshi
<vi...@gmail.com>wrote:

> Hello Ted,
>
>                   This is regarding AdaGrad update per feature.Have
> attached  a file which reflects
> http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf  ( 2 )
>
>
>
> It does differ from OnlineLogisticRegression in the way it implements
>
> public double perTermLearningRate(int j) ;
>
>
> This class maintains 2 Dense Vectors
>
> /**
>
>  * ADA  Per Term Sum of Squares of Learning gradients
>
>  */
>
> protected Vector perTermLSumOfSquaresOfGradients;
>
> /**
>
>  * ADA Per Term Learning gradient
>
>  */
>
> protected Vector perTermGradients;
>
> and it overrides the learn(.... ) method to  update these two vectors
> respectively.
>
>
>
>
> Please tell me if I am totally off here.
>
>
>
> Thank you for your help and Regards.
>
>
> Vishal Santoshi.
>
>
> PS . I had wrongly interpreted the code. last 2 emails. Please ignore.
>
>
>
> On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <te...@gmail.com>wrote:
>
>> :-)
>>
>> Many leaks are *very* subtle.
>>
>> One leak that had me going for weeks was in a news wire corpus.  I
>> couldn't
>> figure out why the cross validation was so good and running the classifier
>> on new data was soooo much worse.
>>
>> The answer was that the training corpus had near-duplicate articles.  This
>> means that there was leakage between the training and test corpora.  This
>> wasn't quite a target leak, but it was a leak.
>>
>> For target leaks, it is very common to have partial target leaks due to
>> the
>> fact that you learn more about positive cases after the moment that you
>> had
>> to select which case to investigate.  Suppose, for instance you are
>> targeting potential customers based on very limited information.  If you
>> make an enticing offer to the people you target, then those who accept the
>> offer will buy something from you.  You will also learn some particulars
>> such as name and address from those who buy from you.
>>
>> Looking retrospectively, it looks like you can target good customers who
>> have names or addresses that are not null.  Without a good snapshot of
>> each
>> customer record at exactly the time that the targeting was done, you
>> cannot
>> know that *all* customers have a null name and address before you target
>> them.  This sort of time machine leak can be enormously more subtle than
>> this.
>>
>>
>>
>> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com> wrote:
>>
>> > Gokhan
>> >
>> >
>> > On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> >
>> > > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
>> > > vishal.santoshi@gmail.com>
>> > >
>> > > >
>> > > >
>> > > > Are we to assume that SGD is still a work in progress and
>> > > implementations (
>> > > > Cross Fold, Online, Adaptive ) are too flawed to be realistically
>> used
>> > ?
>> > > >
>> > >
>> > > They are too raw to be accepted uncritically, for sure.  They have
>> been
>> > > used successfully in production.
>> > >
>> > >
>> > > > The evolutionary algorithm seems to be the core of
>> > > > OnlineLogisticRegression,
>> > > > which in turn builds up to Adaptive/Cross Fold.
>> > > >
>> > > > >>b) for truly on-line learning where no repeated passes through the
>> > > data..
>> > > >
>> > > > What would it take to get to an implementation ? How can any one
>> help ?
>> > > >
>> > >
>> > > Would you like to help on this?  The amount of work required to get a
>> > > distributed asynchronous learner up is moderate, but definitely not
>> huge.
>> > >
>> >
>> > Ted, do you describe a generic distributed learner for all kinds of
>> online
>> > algorithms? Possibly zookeeper-coordinated and with #predict and
>> > #getFeedbackAndUpdateTheModel methods?
>> >
>> > >
>> > > I think that OnlineLogisticRegression is basically sound, but should
>> get
>> > a
>> > > better learning rate update equation.  That would largely make the
>> > > Adaptive* stuff unnecessary, expecially if OLR could be used in the
>> > > distributed asynchronous learner.
>> > >
>> >
>>
>
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Hello Ted,

                  This is regarding AdaGrad update per feature.Have
attached  a file which reflects
http://www.ark.cs.cmu.edu/cdyer/adagrad.pdf ( 2 )



It does differ from OnlineLogisticRegression in the way it implements

public double perTermLearningRate(int j) ;


This class maintains 2 Dense Vectors

/**

 * ADA  Per Term Sum of Squares of Learning gradients

 */

protected Vector perTermLSumOfSquaresOfGradients;

/**

 * ADA Per Term Learning gradient

 */

protected Vector perTermGradients;

and it overrides the learn(.... ) method to  update these two vectors
respectively.




Please tell me if I am totally off here.



Thank you for your help and Regards.


Vishal Santoshi.


PS . I had wrongly interpreted the code. last 2 emails. Please ignore.



On Sun, Dec 29, 2013 at 8:45 PM, Ted Dunning <te...@gmail.com> wrote:

> :-)
>
> Many leaks are *very* subtle.
>
> One leak that had me going for weeks was in a news wire corpus.  I couldn't
> figure out why the cross validation was so good and running the classifier
> on new data was soooo much worse.
>
> The answer was that the training corpus had near-duplicate articles.  This
> means that there was leakage between the training and test corpora.  This
> wasn't quite a target leak, but it was a leak.
>
> For target leaks, it is very common to have partial target leaks due to the
> fact that you learn more about positive cases after the moment that you had
> to select which case to investigate.  Suppose, for instance you are
> targeting potential customers based on very limited information.  If you
> make an enticing offer to the people you target, then those who accept the
> offer will buy something from you.  You will also learn some particulars
> such as name and address from those who buy from you.
>
> Looking retrospectively, it looks like you can target good customers who
> have names or addresses that are not null.  Without a good snapshot of each
> customer record at exactly the time that the targeting was done, you cannot
> know that *all* customers have a null name and address before you target
> them.  This sort of time machine leak can be enormously more subtle than
> this.
>
>
>
> On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com> wrote:
>
> > Gokhan
> >
> >
> > On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
> > wrote:
> >
> > > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > > vishal.santoshi@gmail.com>
> > >
> > > >
> > > >
> > > > Are we to assume that SGD is still a work in progress and
> > > implementations (
> > > > Cross Fold, Online, Adaptive ) are too flawed to be realistically
> used
> > ?
> > > >
> > >
> > > They are too raw to be accepted uncritically, for sure.  They have been
> > > used successfully in production.
> > >
> > >
> > > > The evolutionary algorithm seems to be the core of
> > > > OnlineLogisticRegression,
> > > > which in turn builds up to Adaptive/Cross Fold.
> > > >
> > > > >>b) for truly on-line learning where no repeated passes through the
> > > data..
> > > >
> > > > What would it take to get to an implementation ? How can any one
> help ?
> > > >
> > >
> > > Would you like to help on this?  The amount of work required to get a
> > > distributed asynchronous learner up is moderate, but definitely not
> huge.
> > >
> >
> > Ted, do you describe a generic distributed learner for all kinds of
> online
> > algorithms? Possibly zookeeper-coordinated and with #predict and
> > #getFeedbackAndUpdateTheModel methods?
> >
> > >
> > > I think that OnlineLogisticRegression is basically sound, but should
> get
> > a
> > > better learning rate update equation.  That would largely make the
> > > Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > > distributed asynchronous learner.
> > >
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
:-)

Many leaks are *very* subtle.

One leak that had me going for weeks was in a news wire corpus.  I couldn't
figure out why the cross validation was so good and running the classifier
on new data was soooo much worse.

The answer was that the training corpus had near-duplicate articles.  This
means that there was leakage between the training and test corpora.  This
wasn't quite a target leak, but it was a leak.

For target leaks, it is very common to have partial target leaks due to the
fact that you learn more about positive cases after the moment that you had
to select which case to investigate.  Suppose, for instance you are
targeting potential customers based on very limited information.  If you
make an enticing offer to the people you target, then those who accept the
offer will buy something from you.  You will also learn some particulars
such as name and address from those who buy from you.

Looking retrospectively, it looks like you can target good customers who
have names or addresses that are not null.  Without a good snapshot of each
customer record at exactly the time that the targeting was done, you cannot
know that *all* customers have a null name and address before you target
them.  This sort of time machine leak can be enormously more subtle than
this.



On Mon, Dec 2, 2013 at 1:50 PM, Gokhan Capan <gk...@gmail.com> wrote:

> Gokhan
>
>
> On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> > vishal.santoshi@gmail.com>
> >
> > >
> > >
> > > Are we to assume that SGD is still a work in progress and
> > implementations (
> > > Cross Fold, Online, Adaptive ) are too flawed to be realistically used
> ?
> > >
> >
> > They are too raw to be accepted uncritically, for sure.  They have been
> > used successfully in production.
> >
> >
> > > The evolutionary algorithm seems to be the core of
> > > OnlineLogisticRegression,
> > > which in turn builds up to Adaptive/Cross Fold.
> > >
> > > >>b) for truly on-line learning where no repeated passes through the
> > data..
> > >
> > > What would it take to get to an implementation ? How can any one help ?
> > >
> >
> > Would you like to help on this?  The amount of work required to get a
> > distributed asynchronous learner up is moderate, but definitely not huge.
> >
>
> Ted, do you describe a generic distributed learner for all kinds of online
> algorithms? Possibly zookeeper-coordinated and with #predict and
> #getFeedbackAndUpdateTheModel methods?
>
> >
> > I think that OnlineLogisticRegression is basically sound, but should get
> a
> > better learning rate update equation.  That would largely make the
> > Adaptive* stuff unnecessary, expecially if OLR could be used in the
> > distributed asynchronous learner.
> >
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Gokhan Capan <gk...@gmail.com>.
Gokhan


On Thu, Nov 28, 2013 at 3:18 AM, Ted Dunning <te...@gmail.com> wrote:

> On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <
> vishal.santoshi@gmail.com>
>
> >
> >
> > Are we to assume that SGD is still a work in progress and
> implementations (
> > Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
> >
>
> They are too raw to be accepted uncritically, for sure.  They have been
> used successfully in production.
>
>
> > The evolutionary algorithm seems to be the core of
> > OnlineLogisticRegression,
> > which in turn builds up to Adaptive/Cross Fold.
> >
> > >>b) for truly on-line learning where no repeated passes through the
> data..
> >
> > What would it take to get to an implementation ? How can any one help ?
> >
>
> Would you like to help on this?  The amount of work required to get a
> distributed asynchronous learner up is moderate, but definitely not huge.
>

Ted, do you describe a generic distributed learner for all kinds of online
algorithms? Possibly zookeeper-coordinated and with #predict and
#getFeedbackAndUpdateTheModel methods?

>
> I think that OnlineLogisticRegression is basically sound, but should get a
> better learning rate update equation.  That would largely make the
> Adaptive* stuff unnecessary, expecially if OLR could be used in the
> distributed asynchronous learner.
>

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
On Wed, Nov 27, 2013 at 7:07 AM, Vishal Santoshi <vi...@gmail.com>

>
>
> Are we to assume that SGD is still a work in progress and implementations (
> Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
>

They are too raw to be accepted uncritically, for sure.  They have been
used successfully in production.


> The evolutionary algorithm seems to be the core of
> OnlineLogisticRegression,
> which in turn builds up to Adaptive/Cross Fold.
>
> >>b) for truly on-line learning where no repeated passes through the data..
>
> What would it take to get to an implementation ? How can any one help ?
>

Would you like to help on this?  The amount of work required to get a
distributed asynchronous learner up is moderate, but definitely not huge.

I think that OnlineLogisticRegression is basically sound, but should get a
better learning rate update equation.  That would largely make the
Adaptive* stuff unnecessary, expecially if OLR could be used in the
distributed asynchronous learner.

Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Vishal Santoshi <vi...@gmail.com>.
Hell Ted,

Are we to assume that SGD is still a work in progress and implementations (
Cross Fold, Online, Adaptive ) are too flawed to be realistically used ?
The evolutionary algorithm seems to be the core of OnlineLogisticRegression,
which in turn builds up to Adaptive/Cross Fold.

>>b) for truly on-line learning where no repeated passes through the data..

What would it take to get to an implementation ? How can any one help ?

Regards,





On Wed, Nov 27, 2013 at 2:26 AM, Ted Dunning <te...@gmail.com> wrote:

> Well, first off, let me say that I am much less of a fan now of the magical
> cross validation approach and adaptation based on that than I was when I
> wrote the ALR code.  There are definitely legs in the ideas, but my
> implementation has a number of flaws.
>
> For example:
>
> a) the way that I provide for handling multiple passes through the data is
> very easy to screw up.  I think that simply separating the data entirely
> might be a better approach.
>
> b) for truly on-line learning where no repeated passes through the data
> will ever occur, then cross validation is not the best choice.  Much better
> in those cases to use what Google researchers described in [1].
>
> c) it is clear from several reports that the evolutionary algorithm
> prematurely shuts down the learning rate.  I think that Adagrad-like
> learning rates are more reliable.  See [1] again for one of the more
> readable descriptions of this.  See also [2] for another view on adaptive
> learning rates.
>
> d) item (c) is also related to the way that learning rates are adapted in
> the underlying OnlineLogisticRegression.  That needs to be fixed.
>
> e) asynchronous parallel stochastic gradient descent with mini-batch
> learning is where we should be headed.  I do not have time to write it,
> however.
>
> All this aside, I am happy to help in any way that I can given my recent
> time limits.
>
>
> [1] http://research.google.com/pubs/pub41159.html
>
> [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf
>
>
>
> On Tue, Nov 26, 2013 at 12:54 PM, optimusfan <op...@yahoo.com> wrote:
>
> > Hi-
> >
> > We're currently working on a binary classifier using
> > Mahout's AdaptiveLogisticRegression class.  We're trying to determine
> > whether or not the models are suffering from high bias or variance and
> were
> > wondering how to do this using Mahout's APIs?  I can easily calculate the
> > cross validation error and I think I could detect high bias or variance
> if
> > I could compare that number to my training error, but I'm not sure how to
> > do this.  Or, any other ideas would be appreciated!
> >
> > Thanks,
> > Ian
>

Moving valuable docs to Mahout site [Was: Re: Detecting high bias and variance in AdaptiveLogisticRegression classification]

Posted by Isabel Drost-Fromm <is...@apache.org>.
Hi,

when going through our wiki docs to convert them to Apache CMS I got
the impression that we could improve a lot in particular when it comes
to explaining strengths and limitations of particular approaches.

While reading the below (and several other valuable mails
on user@) I started to wonder whether texts like this could be a basis
for more detailed docs. 

The future work part could go on a separate page that tracks potential
future work. One could think about using JIRA as well, but then again
larger items like this do not look like they will be done within the
next few months unless someone with some particular interest steps up
to work on these...

What do you think?


Isabel


On Tue, 26 Nov 2013 23:26:11 -0800
Ted Dunning <te...@gmail.com> wrote:

> Well, first off, let me say that I am much less of a fan now of the
> magical cross validation approach and adaptation based on that than I
> was when I wrote the ALR code.  There are definitely legs in the
> ideas, but my implementation has a number of flaws.
> 
> For example:
> 
> a) the way that I provide for handling multiple passes through the
> data is very easy to screw up.  I think that simply separating the
> data entirely might be a better approach.
> 
> b) for truly on-line learning where no repeated passes through the
> data will ever occur, then cross validation is not the best choice.
> Much better in those cases to use what Google researchers described
> in [1].
> 
> c) it is clear from several reports that the evolutionary algorithm
> prematurely shuts down the learning rate.  I think that Adagrad-like
> learning rates are more reliable.  See [1] again for one of the more
> readable descriptions of this.  See also [2] for another view on
> adaptive learning rates.
> 
> d) item (c) is also related to the way that learning rates are
> adapted in the underlying OnlineLogisticRegression.  That needs to be
> fixed.
> 
> e) asynchronous parallel stochastic gradient descent with mini-batch
> learning is where we should be headed.  I do not have time to write
> it, however.
> 
> All this aside, I am happy to help in any way that I can given my
> recent time limits.
> 
> 
> [1] http://research.google.com/pubs/pub41159.html
> 
> [2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf


Re: Detecting high bias and variance in AdaptiveLogisticRegression classification

Posted by Ted Dunning <te...@gmail.com>.
Well, first off, let me say that I am much less of a fan now of the magical
cross validation approach and adaptation based on that than I was when I
wrote the ALR code.  There are definitely legs in the ideas, but my
implementation has a number of flaws.

For example:

a) the way that I provide for handling multiple passes through the data is
very easy to screw up.  I think that simply separating the data entirely
might be a better approach.

b) for truly on-line learning where no repeated passes through the data
will ever occur, then cross validation is not the best choice.  Much better
in those cases to use what Google researchers described in [1].

c) it is clear from several reports that the evolutionary algorithm
prematurely shuts down the learning rate.  I think that Adagrad-like
learning rates are more reliable.  See [1] again for one of the more
readable descriptions of this.  See also [2] for another view on adaptive
learning rates.

d) item (c) is also related to the way that learning rates are adapted in
the underlying OnlineLogisticRegression.  That needs to be fixed.

e) asynchronous parallel stochastic gradient descent with mini-batch
learning is where we should be headed.  I do not have time to write it,
however.

All this aside, I am happy to help in any way that I can given my recent
time limits.


[1] http://research.google.com/pubs/pub41159.html

[2] http://www.cs.jhu.edu/~mdredze/publications/cw_nips_08.pdf



On Tue, Nov 26, 2013 at 12:54 PM, optimusfan <op...@yahoo.com> wrote:

> Hi-
>
> We're currently working on a binary classifier using
> Mahout's AdaptiveLogisticRegression class.  We're trying to determine
> whether or not the models are suffering from high bias or variance and were
> wondering how to do this using Mahout's APIs?  I can easily calculate the
> cross validation error and I think I could detect high bias or variance if
> I could compare that number to my training error, but I'm not sure how to
> do this.  Or, any other ideas would be appreciated!
>
> Thanks,
> Ian