You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by "Joseph B. Ottinger" <jo...@autumncode.com> on 2017/07/09 13:38:03 UTC

NER training quality

I was planning on training my own model, but I wondered what kind of input
data would give the best results; does the training data have to make
sense, or be representative of common input? I have a dictionary of terms
to mark as entities, and while I have a good bit of sensible data, I need
to add entities to the model fairly often; typically I'll have the entity
name and fairly little information to go with it, so it'd be easiest to use
something like a Markov chain generator to generate content around the
entity, or something. I could also generate fairly static content, but I'd
prefer to train the system well, if possible.

Re: NER training quality

Posted by Daniel Russ <da...@gmail.com>.

Hi Joseph,
If you already have IRC channel data, I would suggest using something like the brat annotator and annotate the entities you want the classifier to find.  It may take some time to accumulate enough training data, but it would be exactly the type of training data you want.  I think that if you chose to use a markov chain, you would essential be training a classifier to learn the parameters of your markov chain.  I don’t want to discourage you from trying the Markov chain, it may work (please report back).  I remember hearing somewhere (in the context of neural networks) that synthetic data is useful for training, but not as useful as real data (maybe from Hinton’s Coursera course).  I think annotation is the more “standard path” people take.  I mention the brat annotator, because openNLP already can handle data in that format.
Hope it works for you…
Daniel

> On Jul 9, 2017, at 1:20 PM, Joseph B. Ottinger <jo...@autumncode.com> wrote:
> 
> *nod* Thanks. The NER will be applied to IRC channel traffic eventually, so
> ideally we'd pull enough channel traffic to start identifying entities
> (projects, really) accurately. The markov chain idea sounds better and
> better to me as an experiment: take IRC data, replace a few select tokens
> with a placeholder, generate lots of input from the chain, generating
> entities in place of the placeholders. We'll see how well that works as I
> progress.
> 
> On Sun, Jul 9, 2017 at 12:36 PM, Daniel Russ <da...@gmail.com> wrote:
> 
>> Hi Joseph,
>>   I don’t remember exactly what features the NER uses, but a general rule
>> of thumb is that you want the training data resembles the unseen data.
>> Think of the training data as a sampling experiment, the closer the sample
>> gets to the population (data not seen) the better the classifier will
>> work.You certainly can use the presences of a word in dictionary as a
>> feature, and that will probably help with the classification.  If you
>> provide a little more about the problem, I could expand the answer a bit.
>> Daniel
>> 
>> 
>> 
>>> On Jul 9, 2017, at 9:38 AM, Joseph B. Ottinger <jo...@autumncode.com>
>> wrote:
>>> 
>>> I was planning on training my own model, but I wondered what kind of
>> input
>>> data would give the best results; does the training data have to make
>>> sense, or be representative of common input? I have a dictionary of terms
>>> to mark as entities, and while I have a good bit of sensible data, I need
>>> to add entities to the model fairly often; typically I'll have the entity
>>> name and fairly little information to go with it, so it'd be easiest to
>> use
>>> something like a Markov chain generator to generate content around the
>>> entity, or something. I could also generate fairly static content, but
>> I'd
>>> prefer to train the system well, if possible.
>> 
>>

Re: NER training quality

Posted by "Joseph B. Ottinger" <jo...@autumncode.com>.

*nod* Thanks. The NER will be applied to IRC channel traffic eventually, so
ideally we'd pull enough channel traffic to start identifying entities
(projects, really) accurately. The markov chain idea sounds better and
better to me as an experiment: take IRC data, replace a few select tokens
with a placeholder, generate lots of input from the chain, generating
entities in place of the placeholders. We'll see how well that works as I
progress.

On Sun, Jul 9, 2017 at 12:36 PM, Daniel Russ <da...@gmail.com> wrote:

> Hi Joseph,
>    I don’t remember exactly what features the NER uses, but a general rule
> of thumb is that you want the training data resembles the unseen data.
> Think of the training data as a sampling experiment, the closer the sample
> gets to the population (data not seen) the better the classifier will
> work.You certainly can use the presences of a word in dictionary as a
> feature, and that will probably help with the classification.  If you
> provide a little more about the problem, I could expand the answer a bit.
> Daniel
>
>
>
> > On Jul 9, 2017, at 9:38 AM, Joseph B. Ottinger <jo...@autumncode.com>
> wrote:
> >
> > I was planning on training my own model, but I wondered what kind of
> input
> > data would give the best results; does the training data have to make
> > sense, or be representative of common input? I have a dictionary of terms
> > to mark as entities, and while I have a good bit of sensible data, I need
> > to add entities to the model fairly often; typically I'll have the entity
> > name and fairly little information to go with it, so it'd be easiest to
> use
> > something like a Markov chain generator to generate content around the
> > entity, or something. I could also generate fairly static content, but
> I'd
> > prefer to train the system well, if possible.
>
>

Re: NER training quality

Posted by Daniel Russ <da...@gmail.com>.

Hi Joseph,
   I don’t remember exactly what features the NER uses, but a general rule of thumb is that you want the training data resembles the unseen data. Think of the training data as a sampling experiment, the closer the sample gets to the population (data not seen) the better the classifier will work.You certainly can use the presences of a word in dictionary as a feature, and that will probably help with the classification.  If you provide a little more about the problem, I could expand the answer a bit.
Daniel

> On Jul 9, 2017, at 9:38 AM, Joseph B. Ottinger <jo...@autumncode.com> wrote:
> 
> I was planning on training my own model, but I wondered what kind of input
> data would give the best results; does the training data have to make
> sense, or be representative of common input? I have a dictionary of terms
> to mark as entities, and while I have a good bit of sensible data, I need
> to add entities to the model fairly often; typically I'll have the entity
> name and fairly little information to go with it, so it'd be easiest to use
> something like a Markov chain generator to generate content around the
> entity, or something. I could also generate fairly static content, but I'd
> prefer to train the system well, if possible.