You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@mahout.apache.org by Night Wolf <ni...@gmail.com> on 2011/11/19 07:43:41 UTC

Mahout: NB Model for Text Classification - In Sample Error

Hey all,

Quick question regarding potential source of in-sample bias for a text
classification project. I'm develop a system which reads text messages
(i.e. SMS) and tries to classify them into a number of categories. We have
a few million messages. We built our training set of a spare window (~2
months) of messages randomly sampled from within this period.

We would like to classify messages within this two month period and also
messages written beyond. Does this raise some problems with sample bias.
>From a set of over 600,000 SMSs we only sampled around 5000 and manually
tagged them for sentiments. I cant see how this would skew our accuracy
results when we are using a training set which is taken from an unseen
period after this two months.

But if anyone could add their 2c on if its as mistake to a) classify SMS
unseen messages in this two month sample time period b) verify accuracy
based on a test set outside this two month sample period.

Cheers,
/N

Re: Mahout: NB Model for Text Classification - In Sample Error

Posted by Ted Dunning <te...@gmail.com>.

Yes for active learning, no for transduction.

You can do semi-supervised clustering as well.  This is a special case of
transduction generally in many ways.

If you mean classify instead of cluster in your last sentence, then that is
definitely a way to do transduction.

On Mon, Nov 21, 2011 at 6:12 PM, Lance Norskog <go...@gmail.com> wrote:

> The active learning and transduction methods create candidates which you
> then check by hand, right? Would it work to cluster untagged items using
> the tagged items as seeds?
>
> Lance
>
> On Fri, Nov 18, 2011 at 11:46 PM, Ted Dunning <te...@gmail.com>
> wrote:
>
> > This test plan is pretty reasonable.  There is inherently going to be
> some
> > form of bias due to the time shift, but the bias is real and will affect
> > your test results the same way it will affect your operational accuracy.
> >  It might be somewhat interesting to estimate the effect over time by
> also
> > testing on a sample from within the same time period, but that is really
> > mostly of academic interest.
> >
> > What I would recommend is that you use some additional techniques to
> > increase your training set size.  Active learning is a classic technique
> > which can help you build a relatively small training set that gives
> > performance comparable to the performance you would get without
> > down-sampling.  Transduction would let you use the untagged data to
> improve
> > your model without increasing the number of tagged samples.
> >
> > One simple approach for active learning is to repeatedly take new
> training
> > samples of untagged messages that is stratified on your first model's
> > score.  In addition, it makes sense to also sample messages that have
> > significant numbers of terms that do not appear in your positive training
> > examples.  These methods are much simpler than doing active learning by
> the
> > book, but give similar results.
> >
> > For transduction, a very simple method is to simply tag the rest of your
> > training data and then train a model using this larger training set.
>  This
> > has benefit because you extend your model effectively using cooccurrences
> >  with known terms.  Again, this is less effective than more formally
> > defined transduction methods, but it can be surprisingly effective.
> >
> > Finally, I would recommend that you consider alternative algorithms than
> > Naive Bayes for your basic model.  This is based on the fact that you
> only
> > have a small training set and Naive Bayes depends in part on having a
> > relatively large number of training examples in order to get a good
> model.
> >
> >
> > On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <ni...@gmail.com>
> > wrote:
> >
> > > Hey all,
> > >
> > > Quick question regarding potential source of in-sample bias for a text
> > > classification project. I'm develop a system which reads text messages
> > > (i.e. SMS) and tries to classify them into a number of categories. We
> > have
> > > a few million messages. We built our training set of a spare window (~2
> > > months) of messages randomly sampled from within this period.
> > >
> > > We would like to classify messages within this two month period and
> also
> > > messages written beyond. Does this raise some problems with sample
> bias.
> > > From a set of over 600,000 SMSs we only sampled around 5000 and
> manually
> > > tagged them for sentiments. I cant see how this would skew our accuracy
> > > results when we are using a training set which is taken from an unseen
> > > period after this two months.
> > >
> > > But if anyone could add their 2c on if its as mistake to a) classify
> SMS
> > > unseen messages in this two month sample time period b) verify accuracy
> > > based on a test set outside this two month sample period.
> > >
> > > Cheers,
> > > /N
> > >
> >
>
>
>
> --
> Lance Norskog
> goksron@gmail.com
>

Re: Mahout: NB Model for Text Classification - In Sample Error

Posted by Lance Norskog <go...@gmail.com>.

The active learning and transduction methods create candidates which you
then check by hand, right? Would it work to cluster untagged items using
the tagged items as seeds?

Lance

On Fri, Nov 18, 2011 at 11:46 PM, Ted Dunning <te...@gmail.com> wrote:

> This test plan is pretty reasonable.  There is inherently going to be some
> form of bias due to the time shift, but the bias is real and will affect
> your test results the same way it will affect your operational accuracy.
>  It might be somewhat interesting to estimate the effect over time by also
> testing on a sample from within the same time period, but that is really
> mostly of academic interest.
>
> What I would recommend is that you use some additional techniques to
> increase your training set size.  Active learning is a classic technique
> which can help you build a relatively small training set that gives
> performance comparable to the performance you would get without
> down-sampling.  Transduction would let you use the untagged data to improve
> your model without increasing the number of tagged samples.
>
> One simple approach for active learning is to repeatedly take new training
> samples of untagged messages that is stratified on your first model's
> score.  In addition, it makes sense to also sample messages that have
> significant numbers of terms that do not appear in your positive training
> examples.  These methods are much simpler than doing active learning by the
> book, but give similar results.
>
> For transduction, a very simple method is to simply tag the rest of your
> training data and then train a model using this larger training set.  This
> has benefit because you extend your model effectively using cooccurrences
>  with known terms.  Again, this is less effective than more formally
> defined transduction methods, but it can be surprisingly effective.
>
> Finally, I would recommend that you consider alternative algorithms than
> Naive Bayes for your basic model.  This is based on the fact that you only
> have a small training set and Naive Bayes depends in part on having a
> relatively large number of training examples in order to get a good model.
>
>
> On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <ni...@gmail.com>
> wrote:
>
> > Hey all,
> >
> > Quick question regarding potential source of in-sample bias for a text
> > classification project. I'm develop a system which reads text messages
> > (i.e. SMS) and tries to classify them into a number of categories. We
> have
> > a few million messages. We built our training set of a spare window (~2
> > months) of messages randomly sampled from within this period.
> >
> > We would like to classify messages within this two month period and also
> > messages written beyond. Does this raise some problems with sample bias.
> > From a set of over 600,000 SMSs we only sampled around 5000 and manually
> > tagged them for sentiments. I cant see how this would skew our accuracy
> > results when we are using a training set which is taken from an unseen
> > period after this two months.
> >
> > But if anyone could add their 2c on if its as mistake to a) classify SMS
> > unseen messages in this two month sample time period b) verify accuracy
> > based on a test set outside this two month sample period.
> >
> > Cheers,
> > /N
> >
>



-- 
Lance Norskog
goksron@gmail.com

Re: Mahout: NB Model for Text Classification - In Sample Error

Posted by Ted Dunning <te...@gmail.com>.

This test plan is pretty reasonable.  There is inherently going to be some
form of bias due to the time shift, but the bias is real and will affect
your test results the same way it will affect your operational accuracy.
 It might be somewhat interesting to estimate the effect over time by also
testing on a sample from within the same time period, but that is really
mostly of academic interest.

What I would recommend is that you use some additional techniques to
increase your training set size.  Active learning is a classic technique
which can help you build a relatively small training set that gives
performance comparable to the performance you would get without
down-sampling.  Transduction would let you use the untagged data to improve
your model without increasing the number of tagged samples.

One simple approach for active learning is to repeatedly take new training
samples of untagged messages that is stratified on your first model's
score.  In addition, it makes sense to also sample messages that have
significant numbers of terms that do not appear in your positive training
examples.  These methods are much simpler than doing active learning by the
book, but give similar results.

For transduction, a very simple method is to simply tag the rest of your
training data and then train a model using this larger training set.  This
has benefit because you extend your model effectively using cooccurrences
 with known terms.  Again, this is less effective than more formally
defined transduction methods, but it can be surprisingly effective.

Finally, I would recommend that you consider alternative algorithms than
Naive Bayes for your basic model.  This is based on the fact that you only
have a small training set and Naive Bayes depends in part on having a
relatively large number of training examples in order to get a good model.

On Fri, Nov 18, 2011 at 10:43 PM, Night Wolf <ni...@gmail.com> wrote:

> Hey all,
>
> Quick question regarding potential source of in-sample bias for a text
> classification project. I'm develop a system which reads text messages
> (i.e. SMS) and tries to classify them into a number of categories. We have
> a few million messages. We built our training set of a spare window (~2
> months) of messages randomly sampled from within this period.
>
> We would like to classify messages within this two month period and also
> messages written beyond. Does this raise some problems with sample bias.
> From a set of over 600,000 SMSs we only sampled around 5000 and manually
> tagged them for sentiments. I cant see how this would skew our accuracy
> results when we are using a training set which is taken from an unseen
> period after this two months.
>
> But if anyone could add their 2c on if its as mistake to a) classify SMS
> unseen messages in this two month sample time period b) verify accuracy
> based on a test set outside this two month sample period.
>
> Cheers,
> /N
>