You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2016/05/01 11:16:34 UTC

Surronding tokens of the entity on MaxEnt models

Hello everybody
How many surrounding tokens are kept into account to find the entity using
a maxent model?
Basically a maxent model should detect an entity looking at the surronding
tokens, right ?
I would like to understand if:

1. can i set the number of tokens on the left side?
2. can i set the number of tokens on the right side too ?

Thank you in advance for the clarification
Best

Damiano

Re: Surronding tokens of the entity on MaxEnt models

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

Of course you can use regex patterns, but it gets pretty complicated. See: https://web.stanford.edu/class/cs124/lec/Maximum_Entropy_Classifiers.pdf, Christopher Manning uses the example of a word that ends in “c” as a feature for the class drug. That could be a regex feature. you could also have a regex pattern, but you need to be very specific. i.e. word (\w+name\+:?) keep in mind that If you are using creating a feature generator, then this works on one word (actually a token) you would have to play games to transform the (String[])tokens back to a string.

looking at your data, i would consider setting a feature “ends with :” and “ends with ,” it appears that the previous word often ends with : and the current words often ends with a comma. Of course check that your tokenizer does not separate the punctuation from the word. You’ll have to see it it works or not.

Hope it helps.
Daniel

On May 2, 2016, at 9:31 AM, Damiano Porta <da...@gmail.com>> wrote:

Hi Daniel! Thank you so much!

Unfortunately, I am not sure. I really do not know what is the best way in
this case.
I have a dataset with patterns like:

my name is {name}, from {location}
name: {name}
full name: {name}
I am {name}, i was born in {location}

etc etc etc

I could use regexes too. Maybe i list of patterns that i can loop for each
document. What do you think? I do not know if i can build a training set
with those example (i have around 100 different patterns).
How can i create those features with my patterns?

Thank you in advance!

2016-05-02 15:19 GMT+02:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>>:

Hi Damiano,

Why are you so sure that your model with not work? A couple of
things to remember, 1. you need quite a bit of training data. Two
sentences does not make a training set. 2. You probably need more than a
window of words as your features. However, you can see that word-2=“name"
and word-1=“is” tend to precede a name. Look into other potential features
and get a larger dataset and your results may surprise you.

Daniel

On May 1, 2016, at 3:13 PM, Jeffrey Zemerick <jz...@apache.org><mailto:
jzemerick@apache.org<ma...@apache.org>>> wrote:

I'm sure the others on this list can give you a more complete answer so I
will try to not lead you astray.

The WindowFeatureGenerator is only one of the available feature generators.
There are many classes that implement the AdaptiveFeatureGenerator
interface [1] and you can, of course, provide your own implementation of
that interface to support additional features. For example, the
SentenceFeatureGenerator [2] looks at the beginning and end of each
training sentence. So to answer your question, the length of the training
sentence should not matter - what matters is if the combination of
configured feature generators used can provide a model that accurately
describes the training text.

Jeff

[1]

https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
[2]

https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/SentenceFeatureGenerator.html

On Sun, May 1, 2016 at 12:02 PM, Damiano Porta <da...@gmail.com>
wrote:

Hi Jeff!
Thank you so much for your fast reply.

I have a doubt, let suppose we use this feature with a window of:

2 tokens on the left + *ENTITY* + 2 tokens on the right

The doubt is how can i train the model correctly?

if only the previous 2 tokens and the next 2 tokens matters i should not
use long sentences to training the model. Right?

For example (person-model.train):

1. I am <START:person> Barack <END> and I am the president of USA

2. My name is <START:person> Barack <END> and my surname is Obama

...

Those are two stupid training samples, it is just to let you know my doubt.

In this case i should have:

*I am Barack and I*

*name is Barack and my*

the others tokens (left and right) do not matter. So the sentences on my
training set should be very short, right? Basically I should only define
all the "combinations" of the previous/next 2 tokens, right?

Thank you!
Damiano

2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick <jz...@apache.org>:

I think you are looking for the WindowFeatureGenerator [1]. You can set
the
size of the window by specifying the number of previous tokens and number
of next tokens.

Jeff

[1]

https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html

On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <da...@gmail.com>
wrote:

Hello everybody
How many surrounding tokens are kept into account to find the entity
using
a maxent model?
Basically a maxent model should detect an entity looking at the
surronding
tokens, right ?
I would like to understand if:

1. can i set the number of tokens on the left side?
2. can i set the number of tokens on the right side too ?

Thank you in advance for the clarification
Best

Damiano

Re: Surronding tokens of the entity on MaxEnt models

Posted by Damiano Porta <da...@gmail.com>.

Hi Daniel! Thank you so much!

Unfortunately, I am not sure. I really do not know what is the best way in
this case.
I have a dataset with patterns like:

my name is {name}, from {location}
name: {name}
full name: {name}
I am {name}, i was born in {location}

etc etc etc

I could use regexes too. Maybe i list of patterns that i can loop for each
document. What do you think? I do not know if i can build a training set
with those example (i have around 100 different patterns).
How can i create those features with my patterns?

Thank you in advance!


2016-05-02 15:19 GMT+02:00 Russ, Daniel (NIH/CIT) [E] <dr...@mail.nih.gov>:

> Hi Damiano,
>
>      Why are you so sure that your model with not work?  A couple of
> things to remember, 1. you need quite a bit of training data.  Two
> sentences does not make a training set.  2. You probably need more than a
> window of words as your features.  However, you can see that word-2=“name"
> and word-1=“is” tend to precede a name.  Look into other potential features
> and get a larger dataset and your results may surprise you.
>
> Daniel
>
>
> On May 1, 2016, at 3:13 PM, Jeffrey Zemerick <jzemerick@apache.org<mailto:
> jzemerick@apache.org>> wrote:
>
> I'm sure the others on this list can give you a more complete answer so I
> will try to not lead you astray.
>
> The WindowFeatureGenerator is only one of the available feature generators.
> There are many classes that implement the AdaptiveFeatureGenerator
> interface [1] and you can, of course, provide your own implementation of
> that interface to support additional features. For example, the
> SentenceFeatureGenerator [2] looks at the beginning and end of each
> training sentence. So to answer your question, the length of the training
> sentence should not matter - what matters is if the combination of
> configured feature generators used can provide a model that accurately
> describes the training text.
>
> Jeff
>
> [1]
>
> https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
> [2]
>
> https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/SentenceFeatureGenerator.html
>
>
> On Sun, May 1, 2016 at 12:02 PM, Damiano Porta <da...@gmail.com>
> wrote:
>
> Hi Jeff!
> Thank you so much for your fast reply.
>
> I have a doubt, let suppose we use this feature with a window of:
>
> 2 tokens on the left + *ENTITY* + 2 tokens on the right
>
> The doubt is how can i train the model correctly?
>
> if only the previous 2 tokens and the next 2 tokens matters i should not
> use long sentences to training the model. Right?
>
> For example (person-model.train):
>
> 1. I am <START:person> Barack <END> and I am the president of USA
>
> 2. My name is <START:person> Barack <END> and my surname is Obama
>
> ...
>
> Those are two stupid training samples, it is just to let you know my doubt.
>
> In this case i should have:
>
> *I am Barack and I*
>
> *name is Barack and my*
>
> the others tokens (left and right) do not matter. So the sentences on my
> training set should be very short, right? Basically I should only define
> all the "combinations" of the previous/next 2 tokens, right?
>
> Thank you!
> Damiano
>
>
>
> 2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick <jz...@apache.org>:
>
> I think you are looking for the WindowFeatureGenerator [1]. You can set
> the
> size of the window by specifying the number of previous tokens and number
> of next tokens.
>
> Jeff
>
> [1]
>
>
>
> https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html
>
>
> On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <da...@gmail.com>
> wrote:
>
> Hello everybody
> How many surrounding tokens are kept into account to find the entity
> using
> a maxent model?
> Basically a maxent model should detect an entity looking at the
> surronding
> tokens, right ?
> I would like to understand if:
>
> 1. can i set the number of tokens on the left side?
> 2. can i set the number of tokens on the right side too ?
>
> Thank you in advance for the clarification
> Best
>
> Damiano
>
>
>
>

Re: Surronding tokens of the entity on MaxEnt models

Posted by "Russ, Daniel (NIH/CIT) [E]" <dr...@mail.nih.gov>.

Hi Damiano,

Why are you so sure that your model with not work? A couple of things to remember, 1. you need quite a bit of training data. Two sentences does not make a training set. 2. You probably need more than a window of words as your features. However, you can see that word-2=“name" and word-1=“is” tend to precede a name. Look into other potential features and get a larger dataset and your results may surprise you.

Daniel

On May 1, 2016, at 3:13 PM, Jeffrey Zemerick <jz...@apache.org>> wrote:

I'm sure the others on this list can give you a more complete answer so I
will try to not lead you astray.

Jeff

[1]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
[2]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/SentenceFeatureGenerator.html

On Sun, May 1, 2016 at 12:02 PM, Damiano Porta <da...@gmail.com>
wrote:

Hi Jeff!
Thank you so much for your fast reply.

I have a doubt, let suppose we use this feature with a window of:

2 tokens on the left + *ENTITY* + 2 tokens on the right

The doubt is how can i train the model correctly?

if only the previous 2 tokens and the next 2 tokens matters i should not
use long sentences to training the model. Right?

For example (person-model.train):

1. I am <START:person> Barack <END> and I am the president of USA

2. My name is <START:person> Barack <END> and my surname is Obama

...

Those are two stupid training samples, it is just to let you know my doubt.

In this case i should have:

*I am Barack and I*

*name is Barack and my*

Thank you!
Damiano

2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick <jz...@apache.org>:

I think you are looking for the WindowFeatureGenerator [1]. You can set
the
size of the window by specifying the number of previous tokens and number
of next tokens.

Jeff

[1]

https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html

On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <da...@gmail.com>
wrote:

1. can i set the number of tokens on the left side?
2. can i set the number of tokens on the right side too ?

Thank you in advance for the clarification
Best

Damiano

Re: Surronding tokens of the entity on MaxEnt models

Posted by Jeffrey Zemerick <jz...@apache.org>.

I'm sure the others on this list can give you a more complete answer so I
will try to not lead you astray.

The WindowFeatureGenerator is only one of the available feature generators.
There are many classes that implement the AdaptiveFeatureGenerator
interface [1] and you can, of course, provide your own implementation of
that interface to support additional features. For example, the
SentenceFeatureGenerator [2] looks at the beginning and end of each
training sentence. So to answer your question, the length of the training
sentence should not matter - what matters is if the combination of
configured feature generators used can provide a model that accurately
describes the training text.

Jeff

[1]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/AdaptiveFeatureGenerator.html
[2]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/SentenceFeatureGenerator.html


On Sun, May 1, 2016 at 12:02 PM, Damiano Porta <da...@gmail.com>
wrote:

> Hi Jeff!
> Thank you so much for your fast reply.
>
> I have a doubt, let suppose we use this feature with a window of:
>
> 2 tokens on the left + *ENTITY* + 2 tokens on the right
>
> The doubt is how can i train the model correctly?
>
> if only the previous 2 tokens and the next 2 tokens matters i should not
> use long sentences to training the model. Right?
>
> For example (person-model.train):
>
> 1. I am <START:person> Barack <END> and I am the president of USA
>
> 2. My name is <START:person> Barack <END> and my surname is Obama
>
> ...
>
> Those are two stupid training samples, it is just to let you know my doubt.
>
> In this case i should have:
>
> *I am Barack and I*
>
> *name is Barack and my*
>
> the others tokens (left and right) do not matter. So the sentences on my
> training set should be very short, right? Basically I should only define
> all the "combinations" of the previous/next 2 tokens, right?
>
> Thank you!
> Damiano
>
>
>
> 2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick <jz...@apache.org>:
>
> > I think you are looking for the WindowFeatureGenerator [1]. You can set
> the
> > size of the window by specifying the number of previous tokens and number
> > of next tokens.
> >
> > Jeff
> >
> > [1]
> >
> >
> https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html
> >
> >
> > On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <da...@gmail.com>
> > wrote:
> > >
> > > Hello everybody
> > > How many surrounding tokens are kept into account to find the entity
> > using
> > > a maxent model?
> > > Basically a maxent model should detect an entity looking at the
> > surronding
> > > tokens, right ?
> > > I would like to understand if:
> > >
> > > 1. can i set the number of tokens on the left side?
> > > 2. can i set the number of tokens on the right side too ?
> > >
> > > Thank you in advance for the clarification
> > > Best
> > >
> > > Damiano
> >
>

Re: Surronding tokens of the entity on MaxEnt models

Posted by Damiano Porta <da...@gmail.com>.

Hi Jeff!
Thank you so much for your fast reply.

I have a doubt, let suppose we use this feature with a window of:

2 tokens on the left + *ENTITY* + 2 tokens on the right

The doubt is how can i train the model correctly?

if only the previous 2 tokens and the next 2 tokens matters i should not
use long sentences to training the model. Right?

For example (person-model.train):

1. I am <START:person> Barack <END> and I am the president of USA

2. My name is <START:person> Barack <END> and my surname is Obama

...

Those are two stupid training samples, it is just to let you know my doubt.

In this case i should have:

*I am Barack and I*

*name is Barack and my*

the others tokens (left and right) do not matter. So the sentences on my
training set should be very short, right? Basically I should only define
all the "combinations" of the previous/next 2 tokens, right?

Thank you!
Damiano

2016-05-01 16:07 GMT+02:00 Jeffrey Zemerick <jz...@apache.org>:

> I think you are looking for the WindowFeatureGenerator [1]. You can set the
> size of the window by specifying the number of previous tokens and number
> of next tokens.
>
> Jeff
>
> [1]
>
> https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html
>
>
> On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <da...@gmail.com>
> wrote:
> >
> > Hello everybody
> > How many surrounding tokens are kept into account to find the entity
> using
> > a maxent model?
> > Basically a maxent model should detect an entity looking at the
> surronding
> > tokens, right ?
> > I would like to understand if:
> >
> > 1. can i set the number of tokens on the left side?
> > 2. can i set the number of tokens on the right side too ?
> >
> > Thank you in advance for the clarification
> > Best
> >
> > Damiano
>

Re: Surronding tokens of the entity on MaxEnt models

Posted by Jeffrey Zemerick <jz...@apache.org>.

I think you are looking for the WindowFeatureGenerator [1]. You can set the
size of the window by specifying the number of previous tokens and number
of next tokens.

Jeff

[1]
https://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/util/featuregen/WindowFeatureGenerator.html


On Sun, May 1, 2016 at 5:16 AM, Damiano Porta <da...@gmail.com>
wrote:
>
> Hello everybody
> How many surrounding tokens are kept into account to find the entity using
> a maxent model?
> Basically a maxent model should detect an entity looking at the surronding
> tokens, right ?
> I would like to understand if:
>
> 1. can i set the number of tokens on the left side?
> 2. can i set the number of tokens on the right side too ?
>
> Thank you in advance for the clarification
> Best
>
> Damiano