You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Benedict Holland <be...@gmail.com> on 2018/10/02 17:28:33 UTC

Document Categorizer questions

Hello all,

I have a few questions about the document categorizer that reading the
manual didn't solve.

1. How many individual categories can I include in the training data?

2. Assume I have C categories. If I assume a document will have multiple
categories *c*, should I develop C separate models where labels are is_*c *and
is_not_*c*? For example, assume I have a corpora of text from pet store
advertisements. Model 1 would have tags: is_about_cats and
is_not_about_cats. Model 2 would have tags: is_about_dogs and
is_not_about_dogs. Model 3 would have tags: is_about_birds and is_not_about
birds. One could imagine an ad would be about cats, dogs, and not birds
(for example).

3. When I use the model to estimate the category for a document, do I get a
probability for each of the categories?

4. Should I stem the text tokens or will the ME function handle that for
me?

5. How can I add to the ME function to test out if there are features that
the ME model does not currently include that are probably important? This
might get into model development. I am not sure. It is entirely possible
that I missed that in the documentation.

Thank you so much!
~Ben

Re: Document Categorizer questions

Posted by Nikolai Krot <ta...@gmail.com>.

Hi Ben,

I have some experience with OpenNLP Doccat and I can answer form my
experience

I am using NaiveBayes, i have become convinced that it works better (for
me) than MaxEnt, My setup is that i have not much training data.
http://www.ifp.illinois.edu/~iracohen/publications/precision-ecml04-ColorTR-final.pdf

I work with more than 2 classes but only one class is assigned to a
document. Probabilities for all classes are available but only the best
category is printed. Hack the code to get all categories with probabilities
:)

I always give to the training texts that are pre-tokenized. Doccat
tokenizes by whitespace only.

Adding word bigrams usually helps to improve prediction quality.

Where I saw very good boost of quality is feature selection. So far I have
only used chi2 and want to try Information Gain (check the apper mentioned
above). I do it before training with a set of additional scripts.

best regards,
Nikolai

On Tue, Oct 2, 2018 at 7:28 PM Benedict Holland <
benedict.m.holland@gmail.com> wrote:

> Hello all,
>
> I have a few questions about the document categorizer that reading the
> manual didn't solve.
>
> 1. How many individual categories can I include in the training data?
>
> 2. Assume I have C categories. If I assume a document will have multiple
> categories *c*, should I develop C separate models where labels are is_*c
> *and
> is_not_*c*? For example, assume I have a corpora of text from pet store
> advertisements. Model 1 would have tags: is_about_cats and
> is_not_about_cats. Model 2 would have tags: is_about_dogs and
> is_not_about_dogs. Model 3 would have tags: is_about_birds and is_not_about
> birds. One could imagine an ad would be about cats, dogs, and not birds
> (for example).
>
> 3. When I use the model to estimate the category for a document, do I get a
> probability for each of the categories?
>
> 4. Should I stem the text tokens or will the ME function handle that for
> me?
>
> 5. How can I add to the ME function to test out if there are features that
> the ME model does not currently include that are probably important? This
> might get into model development. I am not sure. It is entirely possible
> that I missed that in the documentation.
>
> Thank you so much!
> ~Ben
>

Re: Document Categorizer questions

Posted by Benedict Holland <be...@gmail.com>.

Hi All,

I say this because the softmax or logit model produces a probability of an
event or events occurring. Assuming they didn't use any mixing
distribution or anything fancy, we make the IIA assumption with logits.
Where I find this is most powerful is with a binary case. While, we can
technically use it to model several events, each of the events are
independent and we really care about the ratio between them all to take out
the denominator. Really, we care about the relative log odds, sometimes.
Again, it really depends on what is going on with the MaxEnt function and
how they are using it. If OpenNLP assumes 2 categories per input document
set, the result is exactly as you state. If they start to assume multiple
categories, particularly for the same text, then how we interpret the
classification probabilities would change.

So I guess, while this is a brute force method, if I have documents that
can belong to 1 or more C categories, is a good solution to develop C
binary models? I believe the only explicit assumption is that the
categories do not overlap, which they obviously don't. I can easily Oh
also, I really don't know how a logit model would handle probability
overlap given the IIA assumption. For example, even in the simple case of
A, B, A&B outcomes, the assumption is violated. At least, I think this is
true. Oh also, I really don't know how a logit model would handle
probability overlap given the IIA assumption. For example, even in the
simple case of A, B, A&B outcomes, the assumption is violated. At least, I
think this is true. do this, even for thousands of separate models. It
isn't that difficult.

I might be wrong about this. NLP comes from a rather odd direction (I do
discrete choice modeling and CS).

Thanks,
~Ben



On Wed, Oct 3, 2018 at 2:02 PM Daniel Russ <dr...@apache.org> wrote:

> Hi Ben,
>
>    I disagree with your assessment that it is a logit model and therefore
> is binary.  MaxEnt is more of a case where you are modeling the a
> Baseline-Category Logit for Nominal responses.  (See Agresti Intro. To
> Categorical Data Analysis 2nd Ed. Chapter 6.1). If you have a binary
> problem, this is exactly the log-odds.  The equations for Baseline-Category
> logic model are exactly the same as in the GISModel for making
> predictions.
>
>    Nikolai makes a interesting comment that Naive Bayes works better for
> him.  That is interesting because discriminative classification methods
> TEND to work better than generative classification methods for a nice
> discussion see (
> https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf)
> specifically see the stoplight problem.  However, Nikolai’s data may be
> have some property that works really well with NB. One thing to remember is
> that proof of the pudding is in the eating.
>
> Daniel
>
>
> > On Oct 3, 2018, at 11:49 AM, Benedict Holland <
> benedict.m.holland@gmail.com> wrote:
> >
> > Hi Daniel,
> >
> > Yes. I am honestly not sure if multi-level classifiers make sense when
> > multiple binary classifiers are so easy. At the end of the day, these are
> > all likelihood estimates and logit models model binary outcomes. It was
> > just strange that in the documentation it made it look like I could have
> a
> > bunch of tags on text in the same file but unless OpenNLP is splitting
> > those out, I don't know how OpenNLP is managing it. Like, I wasn't sure
> if
> > OpenNLP was actually creating C binary classifiers based on the tags
> > themselves.
> >
> > On most of the OpenNLP tokenization, especially in the MaxEnt models,
> > OpenNLP typically accepts tokens, POS tags, and other metadata. For the
> > document classifier, it doesn't seem to work that way. It just seems to
> > accept tokens and I can't quite figure out why. POS tags seem like
> > something important unless this is simply running a token frequency
> > analysis or an LDA or something along those lines, which is fine but I
> > don't quite understand why a MaxEnt would be required. Like, I would like
> > to know what this is actually doing with a set of tokens.
> >
> > The deep learning stuff seems to all rely on brute force and seeing what
> > sticks. I like the MaxEnt models because it doesn't seem nearly as
> > arbitrary or black boxy as a DNN. I assume that a DNN produces better
> > outcomes simply because it examines so many different possible
> independent
> > variables. Basically, it is a very elaborate model selection algorithm
> and
> > once it produces outcomes, we can look into the independent variables and
> > wonder why these mattered.
> >
> > Is it possible to modify or append data to the OpenNLP MaxEnt model
> > framework? I think I might have missed that.
> >
> > Thanks,
> > ~Ben
> >
> > On Wed, Oct 3, 2018 at 10:19 AM Daniel Russ <dr...@apache.org> wrote:
> >
> >> Hi Ben,
> >>
> >>   It sound like you want a multi-label classifier (which can give you
> >> more than 1 outcome class).  There are different ways of attacking the
> >> problem.  You can have multiple binary classifiers for each outcome
> class c
> >> (vs not c).  There may be some overall normalization issues, but yes it
> >> should be the (model) probability of being in a group. Being that
> >> multi-label classification is not my speciality, I'm going to toss it to
> >> the rest of the group for other suggestions.
> >>
> >>   Stemming may improve your results, but it may have no effect.  Test it
> >> and you’ll see.
> >>
> >>  Your question five is really interesting and NLP application
> researchers
> >> struggle with this.  You are asking “what am I missing? Or can I use my
> >> knowledge of the problem to improve the classifier?”  Sorry, but the
> answer
> >> is maybe.    This is why you need “big data” to train really good
> models.
> >> Your model needs to see many many different scenarios to learn how to
> adapt
> >> to the problem. Feature engineering is really difficult and not what
> people
> >> do well.  I hope you are starting to see why deep learning is beating
> other
> >> methodologies.  It can weigh many non-linear combinations of features
> for
> >> the best set of features for classification (limited only by the
> features
> >> you supply). Deep learning is kind of like modeling the features.
> >>
> >> Hope it helps
> >> Daniel
> >>
> >>
> >>> On Oct 2, 2018, at 1:28 PM, Benedict Holland <
> >> benedict.m.holland@gmail.com> wro
> >>> te:
> >>>
> >>> Hello all,
> >>>
> >>> I have a few questions about the document categorizer that reading the
> >>> manual didn't solve.
> >>>
> >>> 1. How many individual categories can I include in the training data?
> >>>
> >>> 2. Assume I have C categories. If I assume a document will have
> multiple
> >>> categories *c*, should I develop C separate models where labels are
> >> is_*c *and
> >>> is_not_*c*? For example, assume I have a corpora of text from pet store
> >>> advertisements. Model 1 would have tags: is_about_cats and
> >>> is_not_about_cats. Model 2 would have tags: is_about_dogs and
> >>> is_not_about_dogs. Model 3 would have tags: is_about_birds and
> >> is_not_about
> >>> birds. One could imagine an ad would be about cats, dogs, and not birds
> >>> (for example).
> >>>
> >>> 3. When I use the model to estimate the category for a document, do I
> >> get a
> >>> probability for each of the categories?
> >>>
> >>> 4. Should I stem the text tokens or will the ME function handle that
> for
> >>> me?
> >>>
> >>> 5. How can I add to the ME function to test out if there are features
> >> that
> >>> the ME model does not currently include that are probably important?
> This
> >>> might get into model development. I am not sure. It is entirely
> possible
> >>> that I missed that in the documentation.
> >>>
> >>> Thank you so much!
> >>> ~Ben
> >>
> >>
>
>

Re: Document Categorizer questions

Posted by Daniel Russ <dr...@apache.org>.

Hi Ben,

   I disagree with your assessment that it is a logit model and therefore is binary.  MaxEnt is more of a case where you are modeling the a Baseline-Category Logit for Nominal responses.  (See Agresti Intro. To Categorical Data Analysis 2nd Ed. Chapter 6.1). If you have a binary problem, this is exactly the log-odds.  The equations for Baseline-Category logic model are exactly the same as in the GISModel for making predictions.  

   Nikolai makes a interesting comment that Naive Bayes works better for him.  That is interesting because discriminative classification methods TEND to work better than generative classification methods for a nice discussion see (https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf) specifically see the stoplight problem.  However, Nikolai’s data may be have some property that works really well with NB. One thing to remember is that proof of the pudding is in the eating.
   
Daniel


> On Oct 3, 2018, at 11:49 AM, Benedict Holland <be...@gmail.com> wrote:
> 
> Hi Daniel,
> 
> Yes. I am honestly not sure if multi-level classifiers make sense when
> multiple binary classifiers are so easy. At the end of the day, these are
> all likelihood estimates and logit models model binary outcomes. It was
> just strange that in the documentation it made it look like I could have a
> bunch of tags on text in the same file but unless OpenNLP is splitting
> those out, I don't know how OpenNLP is managing it. Like, I wasn't sure if
> OpenNLP was actually creating C binary classifiers based on the tags
> themselves.
> 
> On most of the OpenNLP tokenization, especially in the MaxEnt models,
> OpenNLP typically accepts tokens, POS tags, and other metadata. For the
> document classifier, it doesn't seem to work that way. It just seems to
> accept tokens and I can't quite figure out why. POS tags seem like
> something important unless this is simply running a token frequency
> analysis or an LDA or something along those lines, which is fine but I
> don't quite understand why a MaxEnt would be required. Like, I would like
> to know what this is actually doing with a set of tokens.
> 
> The deep learning stuff seems to all rely on brute force and seeing what
> sticks. I like the MaxEnt models because it doesn't seem nearly as
> arbitrary or black boxy as a DNN. I assume that a DNN produces better
> outcomes simply because it examines so many different possible independent
> variables. Basically, it is a very elaborate model selection algorithm and
> once it produces outcomes, we can look into the independent variables and
> wonder why these mattered.
> 
> Is it possible to modify or append data to the OpenNLP MaxEnt model
> framework? I think I might have missed that.
> 
> Thanks,
> ~Ben
> 
> On Wed, Oct 3, 2018 at 10:19 AM Daniel Russ <dr...@apache.org> wrote:
> 
>> Hi Ben,
>> 
>>   It sound like you want a multi-label classifier (which can give you
>> more than 1 outcome class).  There are different ways of attacking the
>> problem.  You can have multiple binary classifiers for each outcome class c
>> (vs not c).  There may be some overall normalization issues, but yes it
>> should be the (model) probability of being in a group. Being that
>> multi-label classification is not my speciality, I'm going to toss it to
>> the rest of the group for other suggestions.
>> 
>>   Stemming may improve your results, but it may have no effect.  Test it
>> and you’ll see.
>> 
>>  Your question five is really interesting and NLP application researchers
>> struggle with this.  You are asking “what am I missing? Or can I use my
>> knowledge of the problem to improve the classifier?”  Sorry, but the answer
>> is maybe.    This is why you need “big data” to train really good models.
>> Your model needs to see many many different scenarios to learn how to adapt
>> to the problem. Feature engineering is really difficult and not what people
>> do well.  I hope you are starting to see why deep learning is beating other
>> methodologies.  It can weigh many non-linear combinations of features for
>> the best set of features for classification (limited only by the features
>> you supply). Deep learning is kind of like modeling the features.
>> 
>> Hope it helps
>> Daniel
>> 
>> 
>>> On Oct 2, 2018, at 1:28 PM, Benedict Holland <
>> benedict.m.holland@gmail.com> wro
>>> te:
>>> 
>>> Hello all,
>>> 
>>> I have a few questions about the document categorizer that reading the
>>> manual didn't solve.
>>> 
>>> 1. How many individual categories can I include in the training data?
>>> 
>>> 2. Assume I have C categories. If I assume a document will have multiple
>>> categories *c*, should I develop C separate models where labels are
>> is_*c *and
>>> is_not_*c*? For example, assume I have a corpora of text from pet store
>>> advertisements. Model 1 would have tags: is_about_cats and
>>> is_not_about_cats. Model 2 would have tags: is_about_dogs and
>>> is_not_about_dogs. Model 3 would have tags: is_about_birds and
>> is_not_about
>>> birds. One could imagine an ad would be about cats, dogs, and not birds
>>> (for example).
>>> 
>>> 3. When I use the model to estimate the category for a document, do I
>> get a
>>> probability for each of the categories?
>>> 
>>> 4. Should I stem the text tokens or will the ME function handle that for
>>> me?
>>> 
>>> 5. How can I add to the ME function to test out if there are features
>> that
>>> the ME model does not currently include that are probably important? This
>>> might get into model development. I am not sure. It is entirely possible
>>> that I missed that in the documentation.
>>> 
>>> Thank you so much!
>>> ~Ben
>> 
>>

Re: Document Categorizer questions

Posted by Benedict Holland <be...@gmail.com>.

Hi Daniel,

Yes. I am honestly not sure if multi-level classifiers make sense when
multiple binary classifiers are so easy. At the end of the day, these are
all likelihood estimates and logit models model binary outcomes. It was
just strange that in the documentation it made it look like I could have a
bunch of tags on text in the same file but unless OpenNLP is splitting
those out, I don't know how OpenNLP is managing it. Like, I wasn't sure if
OpenNLP was actually creating C binary classifiers based on the tags
themselves.

On most of the OpenNLP tokenization, especially in the MaxEnt models,
OpenNLP typically accepts tokens, POS tags, and other metadata. For the
document classifier, it doesn't seem to work that way. It just seems to
accept tokens and I can't quite figure out why. POS tags seem like
something important unless this is simply running a token frequency
analysis or an LDA or something along those lines, which is fine but I
don't quite understand why a MaxEnt would be required. Like, I would like
to know what this is actually doing with a set of tokens.

The deep learning stuff seems to all rely on brute force and seeing what
sticks. I like the MaxEnt models because it doesn't seem nearly as
arbitrary or black boxy as a DNN. I assume that a DNN produces better
outcomes simply because it examines so many different possible independent
variables. Basically, it is a very elaborate model selection algorithm and
once it produces outcomes, we can look into the independent variables and
wonder why these mattered.

Is it possible to modify or append data to the OpenNLP MaxEnt model
framework? I think I might have missed that.

Thanks,
~Ben

On Wed, Oct 3, 2018 at 10:19 AM Daniel Russ <dr...@apache.org> wrote:

> Hi Ben,
>
>    It sound like you want a multi-label classifier (which can give you
> more than 1 outcome class).  There are different ways of attacking the
> problem.  You can have multiple binary classifiers for each outcome class c
> (vs not c).  There may be some overall normalization issues, but yes it
> should be the (model) probability of being in a group. Being that
> multi-label classification is not my speciality, I'm going to toss it to
> the rest of the group for other suggestions.
>
>    Stemming may improve your results, but it may have no effect.  Test it
> and you’ll see.
>
>   Your question five is really interesting and NLP application researchers
> struggle with this.  You are asking “what am I missing? Or can I use my
> knowledge of the problem to improve the classifier?”  Sorry, but the answer
> is maybe.    This is why you need “big data” to train really good models.
> Your model needs to see many many different scenarios to learn how to adapt
> to the problem. Feature engineering is really difficult and not what people
> do well.  I hope you are starting to see why deep learning is beating other
> methodologies.  It can weigh many non-linear combinations of features for
> the best set of features for classification (limited only by the features
> you supply). Deep learning is kind of like modeling the features.
>
> Hope it helps
> Daniel
>
>
> > On Oct 2, 2018, at 1:28 PM, Benedict Holland <
> benedict.m.holland@gmail.com> wro
> > te:
> >
> > Hello all,
> >
> > I have a few questions about the document categorizer that reading the
> > manual didn't solve.
> >
> > 1. How many individual categories can I include in the training data?
> >
> > 2. Assume I have C categories. If I assume a document will have multiple
> > categories *c*, should I develop C separate models where labels are
> is_*c *and
> > is_not_*c*? For example, assume I have a corpora of text from pet store
> > advertisements. Model 1 would have tags: is_about_cats and
> > is_not_about_cats. Model 2 would have tags: is_about_dogs and
> > is_not_about_dogs. Model 3 would have tags: is_about_birds and
> is_not_about
> > birds. One could imagine an ad would be about cats, dogs, and not birds
> > (for example).
> >
> > 3. When I use the model to estimate the category for a document, do I
> get a
> > probability for each of the categories?
> >
> > 4. Should I stem the text tokens or will the ME function handle that for
> > me?
> >
> > 5. How can I add to the ME function to test out if there are features
> that
> > the ME model does not currently include that are probably important? This
> > might get into model development. I am not sure. It is entirely possible
> > that I missed that in the documentation.
> >
> > Thank you so much!
> > ~Ben
>
>

Re: Document Categorizer questions

Posted by Daniel Russ <dr...@apache.org>.

Hi Ben,

   It sound like you want a multi-label classifier (which can give you more than 1 outcome class).  There are different ways of attacking the problem.  You can have multiple binary classifiers for each outcome class c (vs not c).  There may be some overall normalization issues, but yes it should be the (model) probability of being in a group. Being that multi-label classification is not my speciality, I'm going to toss it to the rest of the group for other suggestions.

   Stemming may improve your results, but it may have no effect.  Test it and you’ll see.   

  Your question five is really interesting and NLP application researchers struggle with this.  You are asking “what am I missing? Or can I use my knowledge of the problem to improve the classifier?”  Sorry, but the answer is maybe.    This is why you need “big data” to train really good models.  Your model needs to see many many different scenarios to learn how to adapt to the problem. Feature engineering is really difficult and not what people do well.  I hope you are starting to see why deep learning is beating other methodologies.  It can weigh many non-linear combinations of features for the best set of features for classification (limited only by the features you supply). Deep learning is kind of like modeling the features. 

Hope it helps
Daniel


> On Oct 2, 2018, at 1:28 PM, Benedict Holland <be...@gmail.com> wro
> te:
> 
> Hello all,
> 
> I have a few questions about the document categorizer that reading the
> manual didn't solve.
> 
> 1. How many individual categories can I include in the training data?
> 
> 2. Assume I have C categories. If I assume a document will have multiple
> categories *c*, should I develop C separate models where labels are is_*c *and
> is_not_*c*? For example, assume I have a corpora of text from pet store
> advertisements. Model 1 would have tags: is_about_cats and
> is_not_about_cats. Model 2 would have tags: is_about_dogs and
> is_not_about_dogs. Model 3 would have tags: is_about_birds and is_not_about
> birds. One could imagine an ad would be about cats, dogs, and not birds
> (for example).
> 
> 3. When I use the model to estimate the category for a document, do I get a
> probability for each of the categories?
> 
> 4. Should I stem the text tokens or will the ME function handle that for
> me?
> 
> 5. How can I add to the ME function to test out if there are features that
> the ME model does not currently include that are probably important? This
> might get into model development. I am not sure. It is entirely possible
> that I missed that in the documentation.
> 
> Thank you so much!
> ~Ben