You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@mahout.apache.org by Benson Margulies <bi...@gmail.com> on 2011/04/28 21:35:22 UTC

Logistic Regression Tutorial

Is there a logistic regression tutorial in the house? I've got a stack
of files (Arabic ones, no less) and I want to train and score a
classifier.

Re: Logistic Regression Tutorial

Posted by Mike Nute <mi...@gmail.com>.
Ya that is a little odd...

On Fri, Apr 29, 2011 at 7:36 AM, Benson Margulies <bi...@gmail.com>wrote:

> With some help from Ted (which I plan to turn into a checked-in tool
> if he doesn't get there first), I'm running LR on my initial small
> example.
>
> I adapted Ted's rcv1 sample to digest a directory containing
> subdirectories containing exemplars.
>
> Ted's delightfully small program pushes all of the data into the model
> 'n' times (n is 10 in my current variation. It displays the best
> learner's accuracy at each iteration.
>
> The example is 1000 docs in 10 categories.
>
> With 20k-features, I note that the accuracy scores get worse on each
> iteration of pushing the data into the model.
>
> After the first pass, the model hasn't trained yet. After the second,
> accuracy is 95.6%, and then if drifts gracefully downward with each
> additional iteration, landing at .83.
>
> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
> but this pattern is not intuitive to me.
>
> On Thu, Apr 28, 2011 at 5:59 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > And, of course, the current SGD learner handles the multinomial case.
> >
> > On Thu, Apr 28, 2011 at 2:52 PM, Mike Nute <mi...@gmail.com> wrote:
> >
> >> Once you do the vectorization, that becomes the feature vector for your
> >> GLM.  The problem with doing multinomial logit is that if you have a
> >> feature
> >> vector of size K and N different categories, you end up with K*(N-1)
> >> separate parameters to fit which can be nasty, though there are ways to
> get
> >> around that by constraining them.  The N-way case is equivalent to doing
> >> (N-1) separate binomial logits.
> >>
> >> Does that help with the connection between the vectorization process and
> >> LR?
> >>
> >> MN
> >>
> >> On Thu, Apr 28, 2011 at 5:07 PM, Benson Margulies <
> bimargulies@gmail.com
> >> >wrote:
> >>
> >> > THanks, all. I'm get frustrated really fast when trying to read a PDF.
> >> > I guess I'm a fossil.
> >> >
> >> > On Thu, Apr 28, 2011 at 4:54 PM, Ted Dunning <te...@gmail.com>
> >> > wrote:
> >> > > The TrainNewsGroups class does this not quite as nicely as is
> possible
> >> > (it
> >> > > avoids the TextValueEncoder).
> >> > >
> >> > > I will post a simplified example on github that I just worked up for
> >> > RCV1.
> >> > >
> >> > >
> >> > >
> >> > > On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <
> chris@cellixis.com>
> >> > wrote:
> >> > >
> >> > >> Benson,
> >> > >>
> >> > >> Chapter 14 and 15 discuss the 20 newsgroups classification example
> >> using
> >> > >> bad-of-words.  In this implementation of LR, you have to manually
> >> create
> >> > the
> >> > >> feature vectors when iterating through the files.  The features are
> >> > hashed
> >> > >> into a vector of predetermined length.  The examples are very clear
> >> and
> >> > easy
> >> > >> to setup.  I can send you some code I wrote for a similar problem
> if
> >> it
> >> > will
> >> > >> help.
> >> > >>
> >> > >> Chris
> >> > >>
> >> > >> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
> >> > >>
> >> > >> > Chris,
> >> > >> >
> >> > >> > I'm looking a recently-purchased MIA.
> >> > >> >
> >> > >> > The LR example is all about the donut file, which has features
> that
> >> > >> > don't look anything like, even remotely, a full-up bag-of-words
> >> > >> > vector.
> >> > >> >
> >> > >> > I'm lacking the point of connection between the vectorization
> >> process
> >> > >> > (which we have some experience here with running canopy/kmeans)
> and
> >> > >> > the LR example. It's probably some simple principle that I'm
> failing
> >> > >> > to grasp.
> >> > >> >
> >> > >> > --benson
> >> > >> >
> >> > >> >
> >> > >> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <
> >> chris@cellixis.com>
> >> > >> wrote:
> >> > >> >> Benson,
> >> > >> >>
> >> > >> >> The latest chapters in Mahout in Action cover document
> >> classification
> >> > >> using LR very well.
> >> > >> >>
> >> > >> >> Chris
> >> > >> >>
> >> > >> >>
> >> > >> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
> >> > >> >>
> >> > >> >>> Mike,
> >> > >> >>>
> >> > >> >>> in the time available for the experiment I want to perform, all
> I
> >> > can
> >> > >> >>> imagine doing is turning each document into a bag-of-words
> feature
> >> > >> >>> vector. So, I want to run the pipeline of lucene->vectors->...
> and
> >> > >> >>> train a model. I confess that I don't have the time to try to
> >> absorb
> >> > >> >>> the underlying math, indeed, I have some co-workers who can
> help
> >> me
> >> > >> >>> with that. My problem is entirely plumbing at this point.
> >> > >> >>>
> >> > >> >>> --benson
> >> > >> >>>
> >> > >> >>>
> >> > >> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <
> mike.nute@gmail.com>
> >> > >> wrote:
> >> > >> >>>> Benson,
> >> > >> >>>>
> >> > >> >>>> Lecture 3 in this one is a good intro to the logit model:
> >> > >> >>>>
> >> > >> >>>>
> >> > >>
> >> >
> >>
> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
> >> > >> >>>>
> >> > >> >>>> The lecture notes are pretty solid too so that might be
> faster.
> >> > >> >>>>
> >> > >> >>>> The short version: Logistic Regression is a GLM with the link
> >> > f^-1(x)
> >> > >> =
> >> > >> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can
> >> > >> alternatively use
> >> > >> >>>> Batch or Stochastic Gradient Descent.
> >> > >> >>>>
> >> > >> >>>> I've never done document classification before though, so I'm
> not
> >> > much
> >> > >> help
> >> > >> >>>> with more complicated things like choosing the feature vector.
> >> > >> >>>>
> >> > >> >>>> Good Luck,
> >> > >> >>>> Mike Nute
> >> > >> >>>>
> >> > >> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <
> >> > >> bimargulies@gmail.com>wrote:
> >> > >> >>>>
> >> > >> >>>>> Is there a logistic regression tutorial in the house? I've
> got a
> >> > >> stack
> >> > >> >>>>> of files (Arabic ones, no less) and I want to train and score
> a
> >> > >> >>>>> classifier.
> >> > >> >>>>>
> >> > >> >>>>
> >> > >> >>>>
> >> > >> >>>>
> >> > >> >>>> --
> >> > >> >>>> Michael Nute
> >> > >> >>>> Mike.Nute@gmail.com
> >> > >> >>>>
> >> > >> >>
> >> > >> >>
> >> > >>
> >> > >>
> >> > >
> >> >
> >>
> >>
> >>
> >> --
> >> Michael Nute
> >> Mike.Nute@gmail.com
> >>
> >
>



-- 
Michael Nute
Mike.Nute@gmail.com

Re: Logistic Regression Tutorial

Posted by Ted Dunning <te...@gmail.com>.
Yes.  It does stand for area-under-(RO)-curve

But the better definition is that it is the probability that a randomly
selected positive example will have a higher score than a randomly selected
negative example.  This is not specific to likelihood functions because it
is invariant under any monotonic transformation of the score.  This makes it
very useful for comparing different modeling technologies and is very good
in ranking applications.

AUC is specific to binary cases (there are various extensions to multinomial
cases, none of them very satisfactory and all expensive to compute).



On Fri, Apr 29, 2011 at 9:44 AM, Mike Nute <mi...@gmail.com> wrote:

> Ted,
>
> Dumb question: what is AUC? That stands for area-under-curve right? I've
> always thought of logistic regression in terms of the likelihood function;
> is that a variation on likelihood or something totally different?
>
> Thanks,
> Mike Nute
> -----Original Message-----
> From: Ted Dunning <te...@gmail.com>
> Date: Fri, 29 Apr 2011 09:39:05
> To: <us...@mahout.apache.org>
> Reply-To: user@mahout.apache.org
> Subject: Re: Logistic Regression Tutorial
>
> Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
> but I haven't had time to drill in on it.
>
> With RCV1, however, the AUC stayed constant and high.  AUC is what the
> evolutionary algorithm is fighting for while percent correct is only for a
> single threshold (0.5 for the binary case).  With asymmetric class rates,
> that threshold might be sub-optimal.  AUC doesn't use a threshold so that
> won't be an issue with it.  It is pretty easy to make the evo algorithm use
> percent-correct instead of AUC.
>
> Regarding the over-fitting, these accuracies are on-line estimates being
> reported on held-out data so it should be a reasonable estimate of error.
>  With a time-based train/test split, test performance will probably be a
> bit
> lower than the estimate.
>
> The held-out data is formed by doing cross validation on the fly.  Each
> CrossFoldLearner inside the evolutionary algorithm maintains 5 online
> learning algorithms each of which gets a different split of training and
> test data.  This means that we get an out-of-sample estimate of performance
> every time we add a training sample.
>
> On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bimargulies@gmail.com
> >wrote:
>
> > After the first pass, the model hasn't trained yet. After the second,
> > accuracy is 95.6%, and then if drifts gracefully downward with each
> > additional iteration, landing at .83.
> >
> > I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
> > but this pattern is not intuitive to me.
> >
>
>

Re: Logistic Regression Tutorial

Posted by Mike Nute <mi...@gmail.com>.
Ted,

Dumb question: what is AUC? That stands for area-under-curve right? I've always thought of logistic regression in terms of the likelihood function; is that a variation on likelihood or something totally different?

Thanks,
Mike Nute
-----Original Message-----
From: Ted Dunning <te...@gmail.com>
Date: Fri, 29 Apr 2011 09:39:05 
To: <us...@mahout.apache.org>
Reply-To: user@mahout.apache.org
Subject: Re: Logistic Regression Tutorial

Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
but I haven't had time to drill in on it.

With RCV1, however, the AUC stayed constant and high.  AUC is what the
evolutionary algorithm is fighting for while percent correct is only for a
single threshold (0.5 for the binary case).  With asymmetric class rates,
that threshold might be sub-optimal.  AUC doesn't use a threshold so that
won't be an issue with it.  It is pretty easy to make the evo algorithm use
percent-correct instead of AUC.

Regarding the over-fitting, these accuracies are on-line estimates being
reported on held-out data so it should be a reasonable estimate of error.
 With a time-based train/test split, test performance will probably be a bit
lower than the estimate.

The held-out data is formed by doing cross validation on the fly.  Each
CrossFoldLearner inside the evolutionary algorithm maintains 5 online
learning algorithms each of which gets a different split of training and
test data.  This means that we get an out-of-sample estimate of performance
every time we add a training sample.

On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bi...@gmail.com>wrote:

> After the first pass, the model hasn't trained yet. After the second,
> accuracy is 95.6%, and then if drifts gracefully downward with each
> additional iteration, landing at .83.
>
> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
> but this pattern is not intuitive to me.
>


Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
Still nothing but .5. Oh well, we wait for more data.

On Fri, Apr 29, 2011 at 6:17 PM, Ted Dunning <te...@gmail.com> wrote:
> Try these instead:
>
>    learningAlgorithm.setInterval(800);
>    learningAlgorithm.setAveragingWindow(500);
>
> Pool size should not be decreased without serious reason.  It is already
> quite small by default so that large problems run reasonably fast.
>
> The averaging window needs to be long enough to get stable estimates.
>  Interval needs to be kind of long as well.
>
> On Fri, Apr 29, 2011 at 2:20 PM, Benson Margulies <bi...@gmail.com>wrote:
>
>> The following had no effect.
>>
>>        AdaptiveLogisticRegression model = new
>> AdaptiveLogisticRegression(topicNumbers.size(), FEATURES, new L1());
>>        model.setInterval(200, 200);
>>        model.setAveragingWindow(10);
>>        model.setPoolSize(10);
>>
>> On Fri, Apr 29, 2011 at 1:58 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Hmm... this looks very wrong.  AUC is 0.5 here which indicates that it
>> has
>> > no data.
>> >
>> > There are a few options on AdaptiveLogisticRegression to set the
>> averaging
>> > window and
>> > multi-threading batch size.  These probably should be set very small for
>> > your example (which
>> > is far smaller than I envisioned for this code).
>> >
>> > Alternately, I can set up a non-adaptive trainer that does the EP outside
>> of
>> > the learning.  This
>> > is much slower and much less scalable, but that hardly matters for a
>> > toy-sized problem.
>> >
>> > Let me know if you need that.
>> >
>> > On Fri, Apr 29, 2011 at 10:40 AM, Benson Margulies <
>> bimargulies@gmail.com>wrote:
>> >
>> >> If I read this right, the AUC is constant:
>> >>
>> >>         1       1000 0.50 95.6
>> >>         2       1000 0.50 93.9
>> >>         3       1000 0.50 92.4
>> >>         4       1000 0.50 91.1
>> >>         5       1000 0.50 87.3
>> >>         6       1000 0.50 86.4
>> >>         7       1000 0.50 85.5
>> >>         8       1000 0.50 84.6
>> >>         9       1000 0.50 83.8
>> >>                          0.50 83.1 (final)
>> >>
>> >> Where do I go from here? Just run one iteration? Wait for more data?
>> >>
>> >> On Fri, Apr 29, 2011 at 12:39 PM, Ted Dunning <te...@gmail.com>
>> >> wrote:
>> >> > Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of
>> me,
>> >> > but I haven't had time to drill in on it.
>> >> >
>> >> > With RCV1, however, the AUC stayed constant and high.  AUC is what the
>> >> > evolutionary algorithm is fighting for while percent correct is only
>> for
>> >> a
>> >> > single threshold (0.5 for the binary case).  With asymmetric class
>> rates,
>> >> > that threshold might be sub-optimal.  AUC doesn't use a threshold so
>> that
>> >> > won't be an issue with it.  It is pretty easy to make the evo
>> algorithm
>> >> use
>> >> > percent-correct instead of AUC.
>> >> >
>> >> > Regarding the over-fitting, these accuracies are on-line estimates
>> being
>> >> > reported on held-out data so it should be a reasonable estimate of
>> error.
>> >> >  With a time-based train/test split, test performance will probably be
>> a
>> >> bit
>> >> > lower than the estimate.
>> >> >
>> >> > The held-out data is formed by doing cross validation on the fly.
>>  Each
>> >> > CrossFoldLearner inside the evolutionary algorithm maintains 5 online
>> >> > learning algorithms each of which gets a different split of training
>> and
>> >> > test data.  This means that we get an out-of-sample estimate of
>> >> performance
>> >> > every time we add a training sample.
>> >> >
>> >> > On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <
>> bimargulies@gmail.com
>> >> >wrote:
>> >> >
>> >> >> After the first pass, the model hasn't trained yet. After the second,
>> >> >> accuracy is 95.6%, and then if drifts gracefully downward with each
>> >> >> additional iteration, landing at .83.
>> >> >>
>> >> >> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
>> >> >> but this pattern is not intuitive to me.
>> >> >>
>> >> >
>> >>
>> >
>>
>

Re: Logistic Regression Tutorial

Posted by Ted Dunning <te...@gmail.com>.
Try these instead:

    learningAlgorithm.setInterval(800);
    learningAlgorithm.setAveragingWindow(500);

Pool size should not be decreased without serious reason.  It is already
quite small by default so that large problems run reasonably fast.

The averaging window needs to be long enough to get stable estimates.
 Interval needs to be kind of long as well.

On Fri, Apr 29, 2011 at 2:20 PM, Benson Margulies <bi...@gmail.com>wrote:

> The following had no effect.
>
>        AdaptiveLogisticRegression model = new
> AdaptiveLogisticRegression(topicNumbers.size(), FEATURES, new L1());
>        model.setInterval(200, 200);
>        model.setAveragingWindow(10);
>        model.setPoolSize(10);
>
> On Fri, Apr 29, 2011 at 1:58 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Hmm... this looks very wrong.  AUC is 0.5 here which indicates that it
> has
> > no data.
> >
> > There are a few options on AdaptiveLogisticRegression to set the
> averaging
> > window and
> > multi-threading batch size.  These probably should be set very small for
> > your example (which
> > is far smaller than I envisioned for this code).
> >
> > Alternately, I can set up a non-adaptive trainer that does the EP outside
> of
> > the learning.  This
> > is much slower and much less scalable, but that hardly matters for a
> > toy-sized problem.
> >
> > Let me know if you need that.
> >
> > On Fri, Apr 29, 2011 at 10:40 AM, Benson Margulies <
> bimargulies@gmail.com>wrote:
> >
> >> If I read this right, the AUC is constant:
> >>
> >>         1       1000 0.50 95.6
> >>         2       1000 0.50 93.9
> >>         3       1000 0.50 92.4
> >>         4       1000 0.50 91.1
> >>         5       1000 0.50 87.3
> >>         6       1000 0.50 86.4
> >>         7       1000 0.50 85.5
> >>         8       1000 0.50 84.6
> >>         9       1000 0.50 83.8
> >>                          0.50 83.1 (final)
> >>
> >> Where do I go from here? Just run one iteration? Wait for more data?
> >>
> >> On Fri, Apr 29, 2011 at 12:39 PM, Ted Dunning <te...@gmail.com>
> >> wrote:
> >> > Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of
> me,
> >> > but I haven't had time to drill in on it.
> >> >
> >> > With RCV1, however, the AUC stayed constant and high.  AUC is what the
> >> > evolutionary algorithm is fighting for while percent correct is only
> for
> >> a
> >> > single threshold (0.5 for the binary case).  With asymmetric class
> rates,
> >> > that threshold might be sub-optimal.  AUC doesn't use a threshold so
> that
> >> > won't be an issue with it.  It is pretty easy to make the evo
> algorithm
> >> use
> >> > percent-correct instead of AUC.
> >> >
> >> > Regarding the over-fitting, these accuracies are on-line estimates
> being
> >> > reported on held-out data so it should be a reasonable estimate of
> error.
> >> >  With a time-based train/test split, test performance will probably be
> a
> >> bit
> >> > lower than the estimate.
> >> >
> >> > The held-out data is formed by doing cross validation on the fly.
>  Each
> >> > CrossFoldLearner inside the evolutionary algorithm maintains 5 online
> >> > learning algorithms each of which gets a different split of training
> and
> >> > test data.  This means that we get an out-of-sample estimate of
> >> performance
> >> > every time we add a training sample.
> >> >
> >> > On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <
> bimargulies@gmail.com
> >> >wrote:
> >> >
> >> >> After the first pass, the model hasn't trained yet. After the second,
> >> >> accuracy is 95.6%, and then if drifts gracefully downward with each
> >> >> additional iteration, landing at .83.
> >> >>
> >> >> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
> >> >> but this pattern is not intuitive to me.
> >> >>
> >> >
> >>
> >
>

Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
The following had no effect.

        AdaptiveLogisticRegression model = new
AdaptiveLogisticRegression(topicNumbers.size(), FEATURES, new L1());
        model.setInterval(200, 200);
        model.setAveragingWindow(10);
        model.setPoolSize(10);

On Fri, Apr 29, 2011 at 1:58 PM, Ted Dunning <te...@gmail.com> wrote:
> Hmm... this looks very wrong.  AUC is 0.5 here which indicates that it has
> no data.
>
> There are a few options on AdaptiveLogisticRegression to set the averaging
> window and
> multi-threading batch size.  These probably should be set very small for
> your example (which
> is far smaller than I envisioned for this code).
>
> Alternately, I can set up a non-adaptive trainer that does the EP outside of
> the learning.  This
> is much slower and much less scalable, but that hardly matters for a
> toy-sized problem.
>
> Let me know if you need that.
>
> On Fri, Apr 29, 2011 at 10:40 AM, Benson Margulies <bi...@gmail.com>wrote:
>
>> If I read this right, the AUC is constant:
>>
>>         1       1000 0.50 95.6
>>         2       1000 0.50 93.9
>>         3       1000 0.50 92.4
>>         4       1000 0.50 91.1
>>         5       1000 0.50 87.3
>>         6       1000 0.50 86.4
>>         7       1000 0.50 85.5
>>         8       1000 0.50 84.6
>>         9       1000 0.50 83.8
>>                          0.50 83.1 (final)
>>
>> Where do I go from here? Just run one iteration? Wait for more data?
>>
>> On Fri, Apr 29, 2011 at 12:39 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
>> > but I haven't had time to drill in on it.
>> >
>> > With RCV1, however, the AUC stayed constant and high.  AUC is what the
>> > evolutionary algorithm is fighting for while percent correct is only for
>> a
>> > single threshold (0.5 for the binary case).  With asymmetric class rates,
>> > that threshold might be sub-optimal.  AUC doesn't use a threshold so that
>> > won't be an issue with it.  It is pretty easy to make the evo algorithm
>> use
>> > percent-correct instead of AUC.
>> >
>> > Regarding the over-fitting, these accuracies are on-line estimates being
>> > reported on held-out data so it should be a reasonable estimate of error.
>> >  With a time-based train/test split, test performance will probably be a
>> bit
>> > lower than the estimate.
>> >
>> > The held-out data is formed by doing cross validation on the fly.  Each
>> > CrossFoldLearner inside the evolutionary algorithm maintains 5 online
>> > learning algorithms each of which gets a different split of training and
>> > test data.  This means that we get an out-of-sample estimate of
>> performance
>> > every time we add a training sample.
>> >
>> > On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>> >
>> >> After the first pass, the model hasn't trained yet. After the second,
>> >> accuracy is 95.6%, and then if drifts gracefully downward with each
>> >> additional iteration, landing at .83.
>> >>
>> >> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
>> >> but this pattern is not intuitive to me.
>> >>
>> >
>>
>

Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
Let me try the params, and see if the customer delivers a much larger
wad of data soon before I ask you to do more work.

On Fri, Apr 29, 2011 at 1:58 PM, Ted Dunning <te...@gmail.com> wrote:
> Hmm... this looks very wrong.  AUC is 0.5 here which indicates that it has
> no data.
>
> There are a few options on AdaptiveLogisticRegression to set the averaging
> window and
> multi-threading batch size.  These probably should be set very small for
> your example (which
> is far smaller than I envisioned for this code).
>
> Alternately, I can set up a non-adaptive trainer that does the EP outside of
> the learning.  This
> is much slower and much less scalable, but that hardly matters for a
> toy-sized problem.
>
> Let me know if you need that.
>
> On Fri, Apr 29, 2011 at 10:40 AM, Benson Margulies <bi...@gmail.com>wrote:
>
>> If I read this right, the AUC is constant:
>>
>>         1       1000 0.50 95.6
>>         2       1000 0.50 93.9
>>         3       1000 0.50 92.4
>>         4       1000 0.50 91.1
>>         5       1000 0.50 87.3
>>         6       1000 0.50 86.4
>>         7       1000 0.50 85.5
>>         8       1000 0.50 84.6
>>         9       1000 0.50 83.8
>>                          0.50 83.1 (final)
>>
>> Where do I go from here? Just run one iteration? Wait for more data?
>>
>> On Fri, Apr 29, 2011 at 12:39 PM, Ted Dunning <te...@gmail.com>
>> wrote:
>> > Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
>> > but I haven't had time to drill in on it.
>> >
>> > With RCV1, however, the AUC stayed constant and high.  AUC is what the
>> > evolutionary algorithm is fighting for while percent correct is only for
>> a
>> > single threshold (0.5 for the binary case).  With asymmetric class rates,
>> > that threshold might be sub-optimal.  AUC doesn't use a threshold so that
>> > won't be an issue with it.  It is pretty easy to make the evo algorithm
>> use
>> > percent-correct instead of AUC.
>> >
>> > Regarding the over-fitting, these accuracies are on-line estimates being
>> > reported on held-out data so it should be a reasonable estimate of error.
>> >  With a time-based train/test split, test performance will probably be a
>> bit
>> > lower than the estimate.
>> >
>> > The held-out data is formed by doing cross validation on the fly.  Each
>> > CrossFoldLearner inside the evolutionary algorithm maintains 5 online
>> > learning algorithms each of which gets a different split of training and
>> > test data.  This means that we get an out-of-sample estimate of
>> performance
>> > every time we add a training sample.
>> >
>> > On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>> >
>> >> After the first pass, the model hasn't trained yet. After the second,
>> >> accuracy is 95.6%, and then if drifts gracefully downward with each
>> >> additional iteration, landing at .83.
>> >>
>> >> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
>> >> but this pattern is not intuitive to me.
>> >>
>> >
>>
>

Re: Logistic Regression Tutorial

Posted by Ted Dunning <te...@gmail.com>.
Hmm... this looks very wrong.  AUC is 0.5 here which indicates that it has
no data.

There are a few options on AdaptiveLogisticRegression to set the averaging
window and
multi-threading batch size.  These probably should be set very small for
your example (which
is far smaller than I envisioned for this code).

Alternately, I can set up a non-adaptive trainer that does the EP outside of
the learning.  This
is much slower and much less scalable, but that hardly matters for a
toy-sized problem.

Let me know if you need that.

On Fri, Apr 29, 2011 at 10:40 AM, Benson Margulies <bi...@gmail.com>wrote:

> If I read this right, the AUC is constant:
>
>         1       1000 0.50 95.6
>         2       1000 0.50 93.9
>         3       1000 0.50 92.4
>         4       1000 0.50 91.1
>         5       1000 0.50 87.3
>         6       1000 0.50 86.4
>         7       1000 0.50 85.5
>         8       1000 0.50 84.6
>         9       1000 0.50 83.8
>                          0.50 83.1 (final)
>
> Where do I go from here? Just run one iteration? Wait for more data?
>
> On Fri, Apr 29, 2011 at 12:39 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
> > but I haven't had time to drill in on it.
> >
> > With RCV1, however, the AUC stayed constant and high.  AUC is what the
> > evolutionary algorithm is fighting for while percent correct is only for
> a
> > single threshold (0.5 for the binary case).  With asymmetric class rates,
> > that threshold might be sub-optimal.  AUC doesn't use a threshold so that
> > won't be an issue with it.  It is pretty easy to make the evo algorithm
> use
> > percent-correct instead of AUC.
> >
> > Regarding the over-fitting, these accuracies are on-line estimates being
> > reported on held-out data so it should be a reasonable estimate of error.
> >  With a time-based train/test split, test performance will probably be a
> bit
> > lower than the estimate.
> >
> > The held-out data is formed by doing cross validation on the fly.  Each
> > CrossFoldLearner inside the evolutionary algorithm maintains 5 online
> > learning algorithms each of which gets a different split of training and
> > test data.  This means that we get an out-of-sample estimate of
> performance
> > every time we add a training sample.
> >
> > On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bimargulies@gmail.com
> >wrote:
> >
> >> After the first pass, the model hasn't trained yet. After the second,
> >> accuracy is 95.6%, and then if drifts gracefully downward with each
> >> additional iteration, landing at .83.
> >>
> >> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
> >> but this pattern is not intuitive to me.
> >>
> >
>

Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
If I read this right, the AUC is constant:

         1       1000 0.50 95.6
         2       1000 0.50 93.9
         3       1000 0.50 92.4
         4       1000 0.50 91.1
         5       1000 0.50 87.3
         6       1000 0.50 86.4
         7       1000 0.50 85.5
         8       1000 0.50 84.6
         9       1000 0.50 83.8
                          0.50 83.1 (final)

Where do I go from here? Just run one iteration? Wait for more data?

On Fri, Apr 29, 2011 at 12:39 PM, Ted Dunning <te...@gmail.com> wrote:
> Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
> but I haven't had time to drill in on it.
>
> With RCV1, however, the AUC stayed constant and high.  AUC is what the
> evolutionary algorithm is fighting for while percent correct is only for a
> single threshold (0.5 for the binary case).  With asymmetric class rates,
> that threshold might be sub-optimal.  AUC doesn't use a threshold so that
> won't be an issue with it.  It is pretty easy to make the evo algorithm use
> percent-correct instead of AUC.
>
> Regarding the over-fitting, these accuracies are on-line estimates being
> reported on held-out data so it should be a reasonable estimate of error.
>  With a time-based train/test split, test performance will probably be a bit
> lower than the estimate.
>
> The held-out data is formed by doing cross validation on the fly.  Each
> CrossFoldLearner inside the evolutionary algorithm maintains 5 online
> learning algorithms each of which gets a different split of training and
> test data.  This means that we get an out-of-sample estimate of performance
> every time we add a training sample.
>
> On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bi...@gmail.com>wrote:
>
>> After the first pass, the model hasn't trained yet. After the second,
>> accuracy is 95.6%, and then if drifts gracefully downward with each
>> additional iteration, landing at .83.
>>
>> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
>> but this pattern is not intuitive to me.
>>
>

Re: Logistic Regression Tutorial

Posted by Ted Dunning <te...@gmail.com>.
Yeah... I saw this in weaker form with RCV1.  It bugs the hell out of me,
but I haven't had time to drill in on it.

With RCV1, however, the AUC stayed constant and high.  AUC is what the
evolutionary algorithm is fighting for while percent correct is only for a
single threshold (0.5 for the binary case).  With asymmetric class rates,
that threshold might be sub-optimal.  AUC doesn't use a threshold so that
won't be an issue with it.  It is pretty easy to make the evo algorithm use
percent-correct instead of AUC.

Regarding the over-fitting, these accuracies are on-line estimates being
reported on held-out data so it should be a reasonable estimate of error.
 With a time-based train/test split, test performance will probably be a bit
lower than the estimate.

The held-out data is formed by doing cross validation on the fly.  Each
CrossFoldLearner inside the evolutionary algorithm maintains 5 online
learning algorithms each of which gets a different split of training and
test data.  This means that we get an out-of-sample estimate of performance
every time we add a training sample.

On Fri, Apr 29, 2011 at 4:36 AM, Benson Margulies <bi...@gmail.com>wrote:

> After the first pass, the model hasn't trained yet. After the second,
> accuracy is 95.6%, and then if drifts gracefully downward with each
> additional iteration, landing at .83.
>
> I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
> but this pattern is not intuitive to me.
>

Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
With some help from Ted (which I plan to turn into a checked-in tool
if he doesn't get there first), I'm running LR on my initial small
example.

I adapted Ted's rcv1 sample to digest a directory containing
subdirectories containing exemplars.

Ted's delightfully small program pushes all of the data into the model
'n' times (n is 10 in my current variation. It displays the best
learner's accuracy at each iteration.

The example is 1000 docs in 10 categories.

With 20k-features, I note that the accuracy scores get worse on each
iteration of pushing the data into the model.

After the first pass, the model hasn't trained yet. After the second,
accuracy is 95.6%, and then if drifts gracefully downward with each
additional iteration, landing at .83.

I'm puzzled; I'm accustomed to overfitting causing scores to inflate,
but this pattern is not intuitive to me.

On Thu, Apr 28, 2011 at 5:59 PM, Ted Dunning <te...@gmail.com> wrote:
> And, of course, the current SGD learner handles the multinomial case.
>
> On Thu, Apr 28, 2011 at 2:52 PM, Mike Nute <mi...@gmail.com> wrote:
>
>> Once you do the vectorization, that becomes the feature vector for your
>> GLM.  The problem with doing multinomial logit is that if you have a
>> feature
>> vector of size K and N different categories, you end up with K*(N-1)
>> separate parameters to fit which can be nasty, though there are ways to get
>> around that by constraining them.  The N-way case is equivalent to doing
>> (N-1) separate binomial logits.
>>
>> Does that help with the connection between the vectorization process and
>> LR?
>>
>> MN
>>
>> On Thu, Apr 28, 2011 at 5:07 PM, Benson Margulies <bimargulies@gmail.com
>> >wrote:
>>
>> > THanks, all. I'm get frustrated really fast when trying to read a PDF.
>> > I guess I'm a fossil.
>> >
>> > On Thu, Apr 28, 2011 at 4:54 PM, Ted Dunning <te...@gmail.com>
>> > wrote:
>> > > The TrainNewsGroups class does this not quite as nicely as is possible
>> > (it
>> > > avoids the TextValueEncoder).
>> > >
>> > > I will post a simplified example on github that I just worked up for
>> > RCV1.
>> > >
>> > >
>> > >
>> > > On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <ch...@cellixis.com>
>> > wrote:
>> > >
>> > >> Benson,
>> > >>
>> > >> Chapter 14 and 15 discuss the 20 newsgroups classification example
>> using
>> > >> bad-of-words.  In this implementation of LR, you have to manually
>> create
>> > the
>> > >> feature vectors when iterating through the files.  The features are
>> > hashed
>> > >> into a vector of predetermined length.  The examples are very clear
>> and
>> > easy
>> > >> to setup.  I can send you some code I wrote for a similar problem if
>> it
>> > will
>> > >> help.
>> > >>
>> > >> Chris
>> > >>
>> > >> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
>> > >>
>> > >> > Chris,
>> > >> >
>> > >> > I'm looking a recently-purchased MIA.
>> > >> >
>> > >> > The LR example is all about the donut file, which has features that
>> > >> > don't look anything like, even remotely, a full-up bag-of-words
>> > >> > vector.
>> > >> >
>> > >> > I'm lacking the point of connection between the vectorization
>> process
>> > >> > (which we have some experience here with running canopy/kmeans) and
>> > >> > the LR example. It's probably some simple principle that I'm failing
>> > >> > to grasp.
>> > >> >
>> > >> > --benson
>> > >> >
>> > >> >
>> > >> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <
>> chris@cellixis.com>
>> > >> wrote:
>> > >> >> Benson,
>> > >> >>
>> > >> >> The latest chapters in Mahout in Action cover document
>> classification
>> > >> using LR very well.
>> > >> >>
>> > >> >> Chris
>> > >> >>
>> > >> >>
>> > >> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
>> > >> >>
>> > >> >>> Mike,
>> > >> >>>
>> > >> >>> in the time available for the experiment I want to perform, all I
>> > can
>> > >> >>> imagine doing is turning each document into a bag-of-words feature
>> > >> >>> vector. So, I want to run the pipeline of lucene->vectors->... and
>> > >> >>> train a model. I confess that I don't have the time to try to
>> absorb
>> > >> >>> the underlying math, indeed, I have some co-workers who can help
>> me
>> > >> >>> with that. My problem is entirely plumbing at this point.
>> > >> >>>
>> > >> >>> --benson
>> > >> >>>
>> > >> >>>
>> > >> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com>
>> > >> wrote:
>> > >> >>>> Benson,
>> > >> >>>>
>> > >> >>>> Lecture 3 in this one is a good intro to the logit model:
>> > >> >>>>
>> > >> >>>>
>> > >>
>> >
>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>> > >> >>>>
>> > >> >>>> The lecture notes are pretty solid too so that might be faster.
>> > >> >>>>
>> > >> >>>> The short version: Logistic Regression is a GLM with the link
>> > f^-1(x)
>> > >> =
>> > >> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can
>> > >> alternatively use
>> > >> >>>> Batch or Stochastic Gradient Descent.
>> > >> >>>>
>> > >> >>>> I've never done document classification before though, so I'm not
>> > much
>> > >> help
>> > >> >>>> with more complicated things like choosing the feature vector.
>> > >> >>>>
>> > >> >>>> Good Luck,
>> > >> >>>> Mike Nute
>> > >> >>>>
>> > >> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <
>> > >> bimargulies@gmail.com>wrote:
>> > >> >>>>
>> > >> >>>>> Is there a logistic regression tutorial in the house? I've got a
>> > >> stack
>> > >> >>>>> of files (Arabic ones, no less) and I want to train and score a
>> > >> >>>>> classifier.
>> > >> >>>>>
>> > >> >>>>
>> > >> >>>>
>> > >> >>>>
>> > >> >>>> --
>> > >> >>>> Michael Nute
>> > >> >>>> Mike.Nute@gmail.com
>> > >> >>>>
>> > >> >>
>> > >> >>
>> > >>
>> > >>
>> > >
>> >
>>
>>
>>
>> --
>> Michael Nute
>> Mike.Nute@gmail.com
>>
>

Re: Logistic Regression Tutorial

Posted by Ted Dunning <te...@gmail.com>.
And, of course, the current SGD learner handles the multinomial case.

On Thu, Apr 28, 2011 at 2:52 PM, Mike Nute <mi...@gmail.com> wrote:

> Once you do the vectorization, that becomes the feature vector for your
> GLM.  The problem with doing multinomial logit is that if you have a
> feature
> vector of size K and N different categories, you end up with K*(N-1)
> separate parameters to fit which can be nasty, though there are ways to get
> around that by constraining them.  The N-way case is equivalent to doing
> (N-1) separate binomial logits.
>
> Does that help with the connection between the vectorization process and
> LR?
>
> MN
>
> On Thu, Apr 28, 2011 at 5:07 PM, Benson Margulies <bimargulies@gmail.com
> >wrote:
>
> > THanks, all. I'm get frustrated really fast when trying to read a PDF.
> > I guess I'm a fossil.
> >
> > On Thu, Apr 28, 2011 at 4:54 PM, Ted Dunning <te...@gmail.com>
> > wrote:
> > > The TrainNewsGroups class does this not quite as nicely as is possible
> > (it
> > > avoids the TextValueEncoder).
> > >
> > > I will post a simplified example on github that I just worked up for
> > RCV1.
> > >
> > >
> > >
> > > On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <ch...@cellixis.com>
> > wrote:
> > >
> > >> Benson,
> > >>
> > >> Chapter 14 and 15 discuss the 20 newsgroups classification example
> using
> > >> bad-of-words.  In this implementation of LR, you have to manually
> create
> > the
> > >> feature vectors when iterating through the files.  The features are
> > hashed
> > >> into a vector of predetermined length.  The examples are very clear
> and
> > easy
> > >> to setup.  I can send you some code I wrote for a similar problem if
> it
> > will
> > >> help.
> > >>
> > >> Chris
> > >>
> > >> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
> > >>
> > >> > Chris,
> > >> >
> > >> > I'm looking a recently-purchased MIA.
> > >> >
> > >> > The LR example is all about the donut file, which has features that
> > >> > don't look anything like, even remotely, a full-up bag-of-words
> > >> > vector.
> > >> >
> > >> > I'm lacking the point of connection between the vectorization
> process
> > >> > (which we have some experience here with running canopy/kmeans) and
> > >> > the LR example. It's probably some simple principle that I'm failing
> > >> > to grasp.
> > >> >
> > >> > --benson
> > >> >
> > >> >
> > >> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <
> chris@cellixis.com>
> > >> wrote:
> > >> >> Benson,
> > >> >>
> > >> >> The latest chapters in Mahout in Action cover document
> classification
> > >> using LR very well.
> > >> >>
> > >> >> Chris
> > >> >>
> > >> >>
> > >> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
> > >> >>
> > >> >>> Mike,
> > >> >>>
> > >> >>> in the time available for the experiment I want to perform, all I
> > can
> > >> >>> imagine doing is turning each document into a bag-of-words feature
> > >> >>> vector. So, I want to run the pipeline of lucene->vectors->... and
> > >> >>> train a model. I confess that I don't have the time to try to
> absorb
> > >> >>> the underlying math, indeed, I have some co-workers who can help
> me
> > >> >>> with that. My problem is entirely plumbing at this point.
> > >> >>>
> > >> >>> --benson
> > >> >>>
> > >> >>>
> > >> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com>
> > >> wrote:
> > >> >>>> Benson,
> > >> >>>>
> > >> >>>> Lecture 3 in this one is a good intro to the logit model:
> > >> >>>>
> > >> >>>>
> > >>
> >
> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
> > >> >>>>
> > >> >>>> The lecture notes are pretty solid too so that might be faster.
> > >> >>>>
> > >> >>>> The short version: Logistic Regression is a GLM with the link
> > f^-1(x)
> > >> =
> > >> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can
> > >> alternatively use
> > >> >>>> Batch or Stochastic Gradient Descent.
> > >> >>>>
> > >> >>>> I've never done document classification before though, so I'm not
> > much
> > >> help
> > >> >>>> with more complicated things like choosing the feature vector.
> > >> >>>>
> > >> >>>> Good Luck,
> > >> >>>> Mike Nute
> > >> >>>>
> > >> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <
> > >> bimargulies@gmail.com>wrote:
> > >> >>>>
> > >> >>>>> Is there a logistic regression tutorial in the house? I've got a
> > >> stack
> > >> >>>>> of files (Arabic ones, no less) and I want to train and score a
> > >> >>>>> classifier.
> > >> >>>>>
> > >> >>>>
> > >> >>>>
> > >> >>>>
> > >> >>>> --
> > >> >>>> Michael Nute
> > >> >>>> Mike.Nute@gmail.com
> > >> >>>>
> > >> >>
> > >> >>
> > >>
> > >>
> > >
> >
>
>
>
> --
> Michael Nute
> Mike.Nute@gmail.com
>

Re: Logistic Regression Tutorial

Posted by Mike Nute <mi...@gmail.com>.
Once you do the vectorization, that becomes the feature vector for your
GLM.  The problem with doing multinomial logit is that if you have a feature
vector of size K and N different categories, you end up with K*(N-1)
separate parameters to fit which can be nasty, though there are ways to get
around that by constraining them.  The N-way case is equivalent to doing
(N-1) separate binomial logits.

Does that help with the connection between the vectorization process and
LR?

MN

On Thu, Apr 28, 2011 at 5:07 PM, Benson Margulies <bi...@gmail.com>wrote:

> THanks, all. I'm get frustrated really fast when trying to read a PDF.
> I guess I'm a fossil.
>
> On Thu, Apr 28, 2011 at 4:54 PM, Ted Dunning <te...@gmail.com>
> wrote:
> > The TrainNewsGroups class does this not quite as nicely as is possible
> (it
> > avoids the TextValueEncoder).
> >
> > I will post a simplified example on github that I just worked up for
> RCV1.
> >
> >
> >
> > On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <ch...@cellixis.com>
> wrote:
> >
> >> Benson,
> >>
> >> Chapter 14 and 15 discuss the 20 newsgroups classification example using
> >> bad-of-words.  In this implementation of LR, you have to manually create
> the
> >> feature vectors when iterating through the files.  The features are
> hashed
> >> into a vector of predetermined length.  The examples are very clear and
> easy
> >> to setup.  I can send you some code I wrote for a similar problem if it
> will
> >> help.
> >>
> >> Chris
> >>
> >> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
> >>
> >> > Chris,
> >> >
> >> > I'm looking a recently-purchased MIA.
> >> >
> >> > The LR example is all about the donut file, which has features that
> >> > don't look anything like, even remotely, a full-up bag-of-words
> >> > vector.
> >> >
> >> > I'm lacking the point of connection between the vectorization process
> >> > (which we have some experience here with running canopy/kmeans) and
> >> > the LR example. It's probably some simple principle that I'm failing
> >> > to grasp.
> >> >
> >> > --benson
> >> >
> >> >
> >> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <ch...@cellixis.com>
> >> wrote:
> >> >> Benson,
> >> >>
> >> >> The latest chapters in Mahout in Action cover document classification
> >> using LR very well.
> >> >>
> >> >> Chris
> >> >>
> >> >>
> >> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
> >> >>
> >> >>> Mike,
> >> >>>
> >> >>> in the time available for the experiment I want to perform, all I
> can
> >> >>> imagine doing is turning each document into a bag-of-words feature
> >> >>> vector. So, I want to run the pipeline of lucene->vectors->... and
> >> >>> train a model. I confess that I don't have the time to try to absorb
> >> >>> the underlying math, indeed, I have some co-workers who can help me
> >> >>> with that. My problem is entirely plumbing at this point.
> >> >>>
> >> >>> --benson
> >> >>>
> >> >>>
> >> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com>
> >> wrote:
> >> >>>> Benson,
> >> >>>>
> >> >>>> Lecture 3 in this one is a good intro to the logit model:
> >> >>>>
> >> >>>>
> >>
> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
> >> >>>>
> >> >>>> The lecture notes are pretty solid too so that might be faster.
> >> >>>>
> >> >>>> The short version: Logistic Regression is a GLM with the link
> f^-1(x)
> >> =
> >> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can
> >> alternatively use
> >> >>>> Batch or Stochastic Gradient Descent.
> >> >>>>
> >> >>>> I've never done document classification before though, so I'm not
> much
> >> help
> >> >>>> with more complicated things like choosing the feature vector.
> >> >>>>
> >> >>>> Good Luck,
> >> >>>> Mike Nute
> >> >>>>
> >> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <
> >> bimargulies@gmail.com>wrote:
> >> >>>>
> >> >>>>> Is there a logistic regression tutorial in the house? I've got a
> >> stack
> >> >>>>> of files (Arabic ones, no less) and I want to train and score a
> >> >>>>> classifier.
> >> >>>>>
> >> >>>>
> >> >>>>
> >> >>>>
> >> >>>> --
> >> >>>> Michael Nute
> >> >>>> Mike.Nute@gmail.com
> >> >>>>
> >> >>
> >> >>
> >>
> >>
> >
>



-- 
Michael Nute
Mike.Nute@gmail.com

Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
THanks, all. I'm get frustrated really fast when trying to read a PDF.
I guess I'm a fossil.

On Thu, Apr 28, 2011 at 4:54 PM, Ted Dunning <te...@gmail.com> wrote:
> The TrainNewsGroups class does this not quite as nicely as is possible (it
> avoids the TextValueEncoder).
>
> I will post a simplified example on github that I just worked up for RCV1.
>
>
>
> On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <ch...@cellixis.com> wrote:
>
>> Benson,
>>
>> Chapter 14 and 15 discuss the 20 newsgroups classification example using
>> bad-of-words.  In this implementation of LR, you have to manually create the
>> feature vectors when iterating through the files.  The features are hashed
>> into a vector of predetermined length.  The examples are very clear and easy
>> to setup.  I can send you some code I wrote for a similar problem if it will
>> help.
>>
>> Chris
>>
>> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
>>
>> > Chris,
>> >
>> > I'm looking a recently-purchased MIA.
>> >
>> > The LR example is all about the donut file, which has features that
>> > don't look anything like, even remotely, a full-up bag-of-words
>> > vector.
>> >
>> > I'm lacking the point of connection between the vectorization process
>> > (which we have some experience here with running canopy/kmeans) and
>> > the LR example. It's probably some simple principle that I'm failing
>> > to grasp.
>> >
>> > --benson
>> >
>> >
>> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <ch...@cellixis.com>
>> wrote:
>> >> Benson,
>> >>
>> >> The latest chapters in Mahout in Action cover document classification
>> using LR very well.
>> >>
>> >> Chris
>> >>
>> >>
>> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
>> >>
>> >>> Mike,
>> >>>
>> >>> in the time available for the experiment I want to perform, all I can
>> >>> imagine doing is turning each document into a bag-of-words feature
>> >>> vector. So, I want to run the pipeline of lucene->vectors->... and
>> >>> train a model. I confess that I don't have the time to try to absorb
>> >>> the underlying math, indeed, I have some co-workers who can help me
>> >>> with that. My problem is entirely plumbing at this point.
>> >>>
>> >>> --benson
>> >>>
>> >>>
>> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com>
>> wrote:
>> >>>> Benson,
>> >>>>
>> >>>> Lecture 3 in this one is a good intro to the logit model:
>> >>>>
>> >>>>
>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>> >>>>
>> >>>> The lecture notes are pretty solid too so that might be faster.
>> >>>>
>> >>>> The short version: Logistic Regression is a GLM with the link f^-1(x)
>> =
>> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can
>> alternatively use
>> >>>> Batch or Stochastic Gradient Descent.
>> >>>>
>> >>>> I've never done document classification before though, so I'm not much
>> help
>> >>>> with more complicated things like choosing the feature vector.
>> >>>>
>> >>>> Good Luck,
>> >>>> Mike Nute
>> >>>>
>> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <
>> bimargulies@gmail.com>wrote:
>> >>>>
>> >>>>> Is there a logistic regression tutorial in the house? I've got a
>> stack
>> >>>>> of files (Arabic ones, no less) and I want to train and score a
>> >>>>> classifier.
>> >>>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Michael Nute
>> >>>> Mike.Nute@gmail.com
>> >>>>
>> >>
>> >>
>>
>>
>

Re: Logistic Regression Tutorial

Posted by Ted Dunning <te...@gmail.com>.
The TrainNewsGroups class does this not quite as nicely as is possible (it
avoids the TextValueEncoder).

I will post a simplified example on github that I just worked up for RCV1.



On Thu, Apr 28, 2011 at 1:32 PM, Chris Schilling <ch...@cellixis.com> wrote:

> Benson,
>
> Chapter 14 and 15 discuss the 20 newsgroups classification example using
> bad-of-words.  In this implementation of LR, you have to manually create the
> feature vectors when iterating through the files.  The features are hashed
> into a vector of predetermined length.  The examples are very clear and easy
> to setup.  I can send you some code I wrote for a similar problem if it will
> help.
>
> Chris
>
> On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:
>
> > Chris,
> >
> > I'm looking a recently-purchased MIA.
> >
> > The LR example is all about the donut file, which has features that
> > don't look anything like, even remotely, a full-up bag-of-words
> > vector.
> >
> > I'm lacking the point of connection between the vectorization process
> > (which we have some experience here with running canopy/kmeans) and
> > the LR example. It's probably some simple principle that I'm failing
> > to grasp.
> >
> > --benson
> >
> >
> > On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <ch...@cellixis.com>
> wrote:
> >> Benson,
> >>
> >> The latest chapters in Mahout in Action cover document classification
> using LR very well.
> >>
> >> Chris
> >>
> >>
> >> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
> >>
> >>> Mike,
> >>>
> >>> in the time available for the experiment I want to perform, all I can
> >>> imagine doing is turning each document into a bag-of-words feature
> >>> vector. So, I want to run the pipeline of lucene->vectors->... and
> >>> train a model. I confess that I don't have the time to try to absorb
> >>> the underlying math, indeed, I have some co-workers who can help me
> >>> with that. My problem is entirely plumbing at this point.
> >>>
> >>> --benson
> >>>
> >>>
> >>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com>
> wrote:
> >>>> Benson,
> >>>>
> >>>> Lecture 3 in this one is a good intro to the logit model:
> >>>>
> >>>>
> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
> >>>>
> >>>> The lecture notes are pretty solid too so that might be faster.
> >>>>
> >>>> The short version: Logistic Regression is a GLM with the link f^-1(x)
> =
> >>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can
> alternatively use
> >>>> Batch or Stochastic Gradient Descent.
> >>>>
> >>>> I've never done document classification before though, so I'm not much
> help
> >>>> with more complicated things like choosing the feature vector.
> >>>>
> >>>> Good Luck,
> >>>> Mike Nute
> >>>>
> >>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <
> bimargulies@gmail.com>wrote:
> >>>>
> >>>>> Is there a logistic regression tutorial in the house? I've got a
> stack
> >>>>> of files (Arabic ones, no less) and I want to train and score a
> >>>>> classifier.
> >>>>>
> >>>>
> >>>>
> >>>>
> >>>> --
> >>>> Michael Nute
> >>>> Mike.Nute@gmail.com
> >>>>
> >>
> >>
>
>

Re: Logistic Regression Tutorial

Posted by Chris Schilling <ch...@cellixis.com>.
Benson,

Chapter 14 and 15 discuss the 20 newsgroups classification example using bad-of-words.  In this implementation of LR, you have to manually create the feature vectors when iterating through the files.  The features are hashed into a vector of predetermined length.  The examples are very clear and easy to setup.  I can send you some code I wrote for a similar problem if it will help.

Chris

On Apr 28, 2011, at 1:24 PM, Benson Margulies wrote:

> Chris,
> 
> I'm looking a recently-purchased MIA.
> 
> The LR example is all about the donut file, which has features that
> don't look anything like, even remotely, a full-up bag-of-words
> vector.
> 
> I'm lacking the point of connection between the vectorization process
> (which we have some experience here with running canopy/kmeans) and
> the LR example. It's probably some simple principle that I'm failing
> to grasp.
> 
> --benson
> 
> 
> On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <ch...@cellixis.com> wrote:
>> Benson,
>> 
>> The latest chapters in Mahout in Action cover document classification using LR very well.
>> 
>> Chris
>> 
>> 
>> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
>> 
>>> Mike,
>>> 
>>> in the time available for the experiment I want to perform, all I can
>>> imagine doing is turning each document into a bag-of-words feature
>>> vector. So, I want to run the pipeline of lucene->vectors->... and
>>> train a model. I confess that I don't have the time to try to absorb
>>> the underlying math, indeed, I have some co-workers who can help me
>>> with that. My problem is entirely plumbing at this point.
>>> 
>>> --benson
>>> 
>>> 
>>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com> wrote:
>>>> Benson,
>>>> 
>>>> Lecture 3 in this one is a good intro to the logit model:
>>>> 
>>>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>>>> 
>>>> The lecture notes are pretty solid too so that might be faster.
>>>> 
>>>> The short version: Logistic Regression is a GLM with the link f^-1(x) =
>>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
>>>> Batch or Stochastic Gradient Descent.
>>>> 
>>>> I've never done document classification before though, so I'm not much help
>>>> with more complicated things like choosing the feature vector.
>>>> 
>>>> Good Luck,
>>>> Mike Nute
>>>> 
>>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bi...@gmail.com>wrote:
>>>> 
>>>>> Is there a logistic regression tutorial in the house? I've got a stack
>>>>> of files (Arabic ones, no less) and I want to train and score a
>>>>> classifier.
>>>>> 
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Michael Nute
>>>> Mike.Nute@gmail.com
>>>> 
>> 
>> 


Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
Chris,

I'm looking a recently-purchased MIA.

The LR example is all about the donut file, which has features that
don't look anything like, even remotely, a full-up bag-of-words
vector.

I'm lacking the point of connection between the vectorization process
(which we have some experience here with running canopy/kmeans) and
the LR example. It's probably some simple principle that I'm failing
to grasp.

--benson


On Thu, Apr 28, 2011 at 4:02 PM, Chris Schilling <ch...@cellixis.com> wrote:
> Benson,
>
> The latest chapters in Mahout in Action cover document classification using LR very well.
>
> Chris
>
>
> On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:
>
>> Mike,
>>
>> in the time available for the experiment I want to perform, all I can
>> imagine doing is turning each document into a bag-of-words feature
>> vector. So, I want to run the pipeline of lucene->vectors->... and
>> train a model. I confess that I don't have the time to try to absorb
>> the underlying math, indeed, I have some co-workers who can help me
>> with that. My problem is entirely plumbing at this point.
>>
>> --benson
>>
>>
>> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com> wrote:
>>> Benson,
>>>
>>> Lecture 3 in this one is a good intro to the logit model:
>>>
>>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>>>
>>> The lecture notes are pretty solid too so that might be faster.
>>>
>>> The short version: Logistic Regression is a GLM with the link f^-1(x) =
>>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
>>> Batch or Stochastic Gradient Descent.
>>>
>>> I've never done document classification before though, so I'm not much help
>>> with more complicated things like choosing the feature vector.
>>>
>>> Good Luck,
>>> Mike Nute
>>>
>>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bi...@gmail.com>wrote:
>>>
>>>> Is there a logistic regression tutorial in the house? I've got a stack
>>>> of files (Arabic ones, no less) and I want to train and score a
>>>> classifier.
>>>>
>>>
>>>
>>>
>>> --
>>> Michael Nute
>>> Mike.Nute@gmail.com
>>>
>
>

Re: Logistic Regression Tutorial

Posted by Chris Schilling <ch...@cellixis.com>.
Benson,

The latest chapters in Mahout in Action cover document classification using LR very well.  

Chris


On Apr 28, 2011, at 12:55 PM, Benson Margulies wrote:

> Mike,
> 
> in the time available for the experiment I want to perform, all I can
> imagine doing is turning each document into a bag-of-words feature
> vector. So, I want to run the pipeline of lucene->vectors->... and
> train a model. I confess that I don't have the time to try to absorb
> the underlying math, indeed, I have some co-workers who can help me
> with that. My problem is entirely plumbing at this point.
> 
> --benson
> 
> 
> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com> wrote:
>> Benson,
>> 
>> Lecture 3 in this one is a good intro to the logit model:
>> 
>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>> 
>> The lecture notes are pretty solid too so that might be faster.
>> 
>> The short version: Logistic Regression is a GLM with the link f^-1(x) =
>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
>> Batch or Stochastic Gradient Descent.
>> 
>> I've never done document classification before though, so I'm not much help
>> with more complicated things like choosing the feature vector.
>> 
>> Good Luck,
>> Mike Nute
>> 
>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bi...@gmail.com>wrote:
>> 
>>> Is there a logistic regression tutorial in the house? I've got a stack
>>> of files (Arabic ones, no less) and I want to train and score a
>>> classifier.
>>> 
>> 
>> 
>> 
>> --
>> Michael Nute
>> Mike.Nute@gmail.com
>> 


Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
I'm hoping for an N-way classifier, supervised.

On Thu, Apr 28, 2011 at 4:07 PM, Mike Nute <mi...@gmail.com> wrote:
> Oh gotcha. Are you doing binary classification and will it be a supervised learning? If so and the data set is large enough that oughta work well-enough and be easy. You could run PCA or one of the other dimension reduction techniques if that is too large of a feature vector.
>
> Just off top of my head...
>
> Mike
> -----Original Message-----
> From: Benson Margulies <bi...@gmail.com>
> Date: Thu, 28 Apr 2011 15:55:14
> To: <us...@mahout.apache.org>
> Reply-To: user@mahout.apache.org
> Subject: Re: Logistic Regression Tutorial
>
> Mike,
>
> in the time available for the experiment I want to perform, all I can
> imagine doing is turning each document into a bag-of-words feature
> vector. So, I want to run the pipeline of lucene->vectors->... and
> train a model. I confess that I don't have the time to try to absorb
> the underlying math, indeed, I have some co-workers who can help me
> with that. My problem is entirely plumbing at this point.
>
> --benson
>
>
> On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com> wrote:
>> Benson,
>>
>> Lecture 3 in this one is a good intro to the logit model:
>>
>> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>>
>> The lecture notes are pretty solid too so that might be faster.
>>
>> The short version: Logistic Regression is a GLM with the link f^-1(x) =
>> 1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
>> Batch or Stochastic Gradient Descent.
>>
>> I've never done document classification before though, so I'm not much help
>> with more complicated things like choosing the feature vector.
>>
>> Good Luck,
>> Mike Nute
>>
>> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bi...@gmail.com>wrote:
>>
>>> Is there a logistic regression tutorial in the house? I've got a stack
>>> of files (Arabic ones, no less) and I want to train and score a
>>> classifier.
>>>
>>
>>
>>
>> --
>> Michael Nute
>> Mike.Nute@gmail.com
>>
>

Re: Logistic Regression Tutorial

Posted by Mike Nute <mi...@gmail.com>.
Oh gotcha. Are you doing binary classification and will it be a supervised learning? If so and the data set is large enough that oughta work well-enough and be easy. You could run PCA or one of the other dimension reduction techniques if that is too large of a feature vector. 

Just off top of my head...

Mike
-----Original Message-----
From: Benson Margulies <bi...@gmail.com>
Date: Thu, 28 Apr 2011 15:55:14 
To: <us...@mahout.apache.org>
Reply-To: user@mahout.apache.org
Subject: Re: Logistic Regression Tutorial

Mike,

in the time available for the experiment I want to perform, all I can
imagine doing is turning each document into a bag-of-words feature
vector. So, I want to run the pipeline of lucene->vectors->... and
train a model. I confess that I don't have the time to try to absorb
the underlying math, indeed, I have some co-workers who can help me
with that. My problem is entirely plumbing at this point.

--benson


On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com> wrote:
> Benson,
>
> Lecture 3 in this one is a good intro to the logit model:
>
> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>
> The lecture notes are pretty solid too so that might be faster.
>
> The short version: Logistic Regression is a GLM with the link f^-1(x) =
> 1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
> Batch or Stochastic Gradient Descent.
>
> I've never done document classification before though, so I'm not much help
> with more complicated things like choosing the feature vector.
>
> Good Luck,
> Mike Nute
>
> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bi...@gmail.com>wrote:
>
>> Is there a logistic regression tutorial in the house? I've got a stack
>> of files (Arabic ones, no less) and I want to train and score a
>> classifier.
>>
>
>
>
> --
> Michael Nute
> Mike.Nute@gmail.com
>

Re: Logistic Regression Tutorial

Posted by Benson Margulies <bi...@gmail.com>.
Mike,

in the time available for the experiment I want to perform, all I can
imagine doing is turning each document into a bag-of-words feature
vector. So, I want to run the pipeline of lucene->vectors->... and
train a model. I confess that I don't have the time to try to absorb
the underlying math, indeed, I have some co-workers who can help me
with that. My problem is entirely plumbing at this point.

--benson


On Thu, Apr 28, 2011 at 3:52 PM, Mike Nute <mi...@gmail.com> wrote:
> Benson,
>
> Lecture 3 in this one is a good intro to the logit model:
>
> http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1
>
> The lecture notes are pretty solid too so that might be faster.
>
> The short version: Logistic Regression is a GLM with the link f^-1(x) =
> 1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
> Batch or Stochastic Gradient Descent.
>
> I've never done document classification before though, so I'm not much help
> with more complicated things like choosing the feature vector.
>
> Good Luck,
> Mike Nute
>
> On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bi...@gmail.com>wrote:
>
>> Is there a logistic regression tutorial in the house? I've got a stack
>> of files (Arabic ones, no less) and I want to train and score a
>> classifier.
>>
>
>
>
> --
> Michael Nute
> Mike.Nute@gmail.com
>

Re: Logistic Regression Tutorial

Posted by Mike Nute <mi...@gmail.com>.
Benson,

Lecture 3 in this one is a good intro to the logit model:

http://see.stanford.edu/see/lecturelist.aspx?coll=348ca38a-3a6d-4052-937d-cb017338d7b1

The lecture notes are pretty solid too so that might be faster.

The short version: Logistic Regression is a GLM with the link f^-1(x) =
1/(1+e^(xB)) and a Binomial likelihood function.  You can alternatively use
Batch or Stochastic Gradient Descent.

I've never done document classification before though, so I'm not much help
with more complicated things like choosing the feature vector.

Good Luck,
Mike Nute

On Thu, Apr 28, 2011 at 3:35 PM, Benson Margulies <bi...@gmail.com>wrote:

> Is there a logistic regression tutorial in the house? I've got a stack
> of files (Arabic ones, no less) and I want to train and score a
> classifier.
>



-- 
Michael Nute
Mike.Nute@gmail.com