You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Damiano Porta <da...@gmail.com> on 2016/08/12 14:21:36 UTC

Why are you using complete sentences to train a model?

Hello everyone,
pardon for the stupid question but i really do not get the point about
training a maxent model with complete sentences.

For example:

<START:person> Pierre Vinken <END> , 61 years old , will join the board as
a nonexecutive director Nov. 29 .

it has ~20 tokens.
As described here:
https://opennlp.apache.org/documentation/1.6.0/manual/opennlp.html#tools.namefind.training.featuregen
the default window should be 2 tokens on the left and 2 tokens on the right
of the entity. So, what's the point of writing the entire sentence if there
are no other entities ?

As far i have understood it correctly, it should take into account the
Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So, why
do we need "*years old , will join the board as a nonexecutive*" ?

Thank you in advance for the clarification!

Best
Damiano

Re: Why are you using complete sentences to train a model?

Posted by Damiano Porta <da...@gmail.com>.

oh, i forgot a thing. does the order of the surrounding tokens matter? I
mean if i train:

my name is PERSON

when it searches the entity does it will "exactly" match "name is" or if i
write "is name" is the same thing? (or maybe i need to write the "negative"
version of it)

2016-08-12 16:51 GMT+02:00 Damiano Porta <da...@gmail.com>:

> Ok thank you so much guys!
>
> 2016-08-12 16:43 GMT+02:00 William Colen <wi...@gmail.com>:
>
>> You need to train with a corpus that is as close as possible as your
>> runtime corpus. If your runtime corpus is like that I think it is ok.
>> Otherwise, the model can learn that an entity is too often. Like, there is
>> an entity in the middle of every window.
>>
>>
>> 2016-08-12 11:35 GMT-03:00 Damiano Porta <da...@gmail.com>:
>>
>> > Ok, but why not just ignore all the others tokens? i mean... when i
>> write 2
>> > TOKENS + ENTITY + 2 TOKENS i am interested on finding the entity with
>> this
>> > surrounding tokens so it should mean that other "cases" can be ignored.
>> No?
>> >
>> > Why do i need to write all the other cases when those must be ignored.
>> >
>> > 2016-08-12 16:26 GMT+02:00 William Colen <wi...@gmail.com>:
>> >
>> > > You also need examples of what is not entities.
>> > >
>> > >
>> > > 2016-08-12 11:21 GMT-03:00 Damiano Porta <da...@gmail.com>:
>> > >
>> > > > Hello everyone,
>> > > > pardon for the stupid question but i really do not get the point
>> about
>> > > > training a maxent model with complete sentences.
>> > > >
>> > > > For example:
>> > > >
>> > > > <START:person> Pierre Vinken <END> , 61 years old , will join the
>> board
>> > > as
>> > > > a nonexecutive director Nov. 29 .
>> > > >
>> > > > it has ~20 tokens.
>> > > > As described here:
>> > > > https://opennlp.apache.org/documentation/1.6.0/manual/
>> > > > opennlp.html#tools.namefind.training.featuregen
>> > > > the default window should be 2 tokens on the left and 2 tokens on
>> the
>> > > right
>> > > > of the entity. So, what's the point of writing the entire sentence
>> if
>> > > there
>> > > > are no other entities ?
>> > > >
>> > > > As far i have understood it correctly, it should take into account
>> the
>> > > > Pierre Vinken (as entity name) and "," "61" as the next 2 tokens.
>> So,
>> > why
>> > > > do we need "*years old , will join the board as a nonexecutive*" ?
>> > > >
>> > > > Thank you in advance for the clarification!
>> > > >
>> > > > Best
>> > > > Damiano
>> > > >
>> > >
>> >
>>
>
>

Re: Why are you using complete sentences to train a model?

Posted by Damiano Porta <da...@gmail.com>.

Ok thank you so much guys!

2016-08-12 16:43 GMT+02:00 William Colen <wi...@gmail.com>:

> You need to train with a corpus that is as close as possible as your
> runtime corpus. If your runtime corpus is like that I think it is ok.
> Otherwise, the model can learn that an entity is too often. Like, there is
> an entity in the middle of every window.
>
>
> 2016-08-12 11:35 GMT-03:00 Damiano Porta <da...@gmail.com>:
>
> > Ok, but why not just ignore all the others tokens? i mean... when i
> write 2
> > TOKENS + ENTITY + 2 TOKENS i am interested on finding the entity with
> this
> > surrounding tokens so it should mean that other "cases" can be ignored.
> No?
> >
> > Why do i need to write all the other cases when those must be ignored.
> >
> > 2016-08-12 16:26 GMT+02:00 William Colen <wi...@gmail.com>:
> >
> > > You also need examples of what is not entities.
> > >
> > >
> > > 2016-08-12 11:21 GMT-03:00 Damiano Porta <da...@gmail.com>:
> > >
> > > > Hello everyone,
> > > > pardon for the stupid question but i really do not get the point
> about
> > > > training a maxent model with complete sentences.
> > > >
> > > > For example:
> > > >
> > > > <START:person> Pierre Vinken <END> , 61 years old , will join the
> board
> > > as
> > > > a nonexecutive director Nov. 29 .
> > > >
> > > > it has ~20 tokens.
> > > > As described here:
> > > > https://opennlp.apache.org/documentation/1.6.0/manual/
> > > > opennlp.html#tools.namefind.training.featuregen
> > > > the default window should be 2 tokens on the left and 2 tokens on the
> > > right
> > > > of the entity. So, what's the point of writing the entire sentence if
> > > there
> > > > are no other entities ?
> > > >
> > > > As far i have understood it correctly, it should take into account
> the
> > > > Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So,
> > why
> > > > do we need "*years old , will join the board as a nonexecutive*" ?
> > > >
> > > > Thank you in advance for the clarification!
> > > >
> > > > Best
> > > > Damiano
> > > >
> > >
> >
>

Re: Why are you using complete sentences to train a model?

Posted by William Colen <wi...@gmail.com>.

You need to train with a corpus that is as close as possible as your
runtime corpus. If your runtime corpus is like that I think it is ok.
Otherwise, the model can learn that an entity is too often. Like, there is
an entity in the middle of every window.


2016-08-12 11:35 GMT-03:00 Damiano Porta <da...@gmail.com>:

> Ok, but why not just ignore all the others tokens? i mean... when i write 2
> TOKENS + ENTITY + 2 TOKENS i am interested on finding the entity with this
> surrounding tokens so it should mean that other "cases" can be ignored. No?
>
> Why do i need to write all the other cases when those must be ignored.
>
> 2016-08-12 16:26 GMT+02:00 William Colen <wi...@gmail.com>:
>
> > You also need examples of what is not entities.
> >
> >
> > 2016-08-12 11:21 GMT-03:00 Damiano Porta <da...@gmail.com>:
> >
> > > Hello everyone,
> > > pardon for the stupid question but i really do not get the point about
> > > training a maxent model with complete sentences.
> > >
> > > For example:
> > >
> > > <START:person> Pierre Vinken <END> , 61 years old , will join the board
> > as
> > > a nonexecutive director Nov. 29 .
> > >
> > > it has ~20 tokens.
> > > As described here:
> > > https://opennlp.apache.org/documentation/1.6.0/manual/
> > > opennlp.html#tools.namefind.training.featuregen
> > > the default window should be 2 tokens on the left and 2 tokens on the
> > right
> > > of the entity. So, what's the point of writing the entire sentence if
> > there
> > > are no other entities ?
> > >
> > > As far i have understood it correctly, it should take into account the
> > > Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So,
> why
> > > do we need "*years old , will join the board as a nonexecutive*" ?
> > >
> > > Thank you in advance for the clarification!
> > >
> > > Best
> > > Damiano
> > >
> >
>

Re: Why are you using complete sentences to train a model?

Posted by Richard Eckart de Castilho <ri...@gmail.com>.

The non-entity tokens are not ignored, they server as negative examples and as context.

A machine learning algorithm learns on positive and negative examples.

It also learns on context, so e.g. it learns e.g. that inside a PERSON
entity appears in surroundings like "My name is PERSON ." or "I gave PERON a present".

Without negative examples and without context, you cannot learn.
Then you could also simply use a look up words in a word list, e.g.
a list of  names.

Cheers,

-- Richard

> On 12.08.2016, at 16:35, Damiano Porta <da...@gmail.com> wrote:
> 
> Ok, but why not just ignore all the others tokens? i mean... when i write 2
> TOKENS + ENTITY + 2 TOKENS i am interested on finding the entity with this
> surrounding tokens so it should mean that other "cases" can be ignored. No?
> 
> Why do i need to write all the other cases when those must be ignored.
> 
> 2016-08-12 16:26 GMT+02:00 William Colen <wi...@gmail.com>:
> 
>> You also need examples of what is not entities.
>> 
>> 
>> 2016-08-12 11:21 GMT-03:00 Damiano Porta <da...@gmail.com>:
>> 
>>> Hello everyone,
>>> pardon for the stupid question but i really do not get the point about
>>> training a maxent model with complete sentences.
>>> 
>>> For example:
>>> 
>>> <START:person> Pierre Vinken <END> , 61 years old , will join the board
>> as
>>> a nonexecutive director Nov. 29 .
>>> 
>>> it has ~20 tokens.
>>> As described here:
>>> https://opennlp.apache.org/documentation/1.6.0/manual/
>>> opennlp.html#tools.namefind.training.featuregen
>>> the default window should be 2 tokens on the left and 2 tokens on the
>> right
>>> of the entity. So, what's the point of writing the entire sentence if
>> there
>>> are no other entities ?
>>> 
>>> As far i have understood it correctly, it should take into account the
>>> Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So, why
>>> do we need "*years old , will join the board as a nonexecutive*" ?
>>> 
>>> Thank you in advance for the clarification!
>>> 
>>> Best
>>> Damiano

Re: Why are you using complete sentences to train a model?

Posted by Damiano Porta <da...@gmail.com>.

Ok, but why not just ignore all the others tokens? i mean... when i write 2
TOKENS + ENTITY + 2 TOKENS i am interested on finding the entity with this
surrounding tokens so it should mean that other "cases" can be ignored. No?

Why do i need to write all the other cases when those must be ignored.

2016-08-12 16:26 GMT+02:00 William Colen <wi...@gmail.com>:

> You also need examples of what is not entities.
>
>
> 2016-08-12 11:21 GMT-03:00 Damiano Porta <da...@gmail.com>:
>
> > Hello everyone,
> > pardon for the stupid question but i really do not get the point about
> > training a maxent model with complete sentences.
> >
> > For example:
> >
> > <START:person> Pierre Vinken <END> , 61 years old , will join the board
> as
> > a nonexecutive director Nov. 29 .
> >
> > it has ~20 tokens.
> > As described here:
> > https://opennlp.apache.org/documentation/1.6.0/manual/
> > opennlp.html#tools.namefind.training.featuregen
> > the default window should be 2 tokens on the left and 2 tokens on the
> right
> > of the entity. So, what's the point of writing the entire sentence if
> there
> > are no other entities ?
> >
> > As far i have understood it correctly, it should take into account the
> > Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So, why
> > do we need "*years old , will join the board as a nonexecutive*" ?
> >
> > Thank you in advance for the clarification!
> >
> > Best
> > Damiano
> >
>

Re: Why are you using complete sentences to train a model?

Posted by William Colen <wi...@gmail.com>.

You also need examples of what is not entities.


2016-08-12 11:21 GMT-03:00 Damiano Porta <da...@gmail.com>:

> Hello everyone,
> pardon for the stupid question but i really do not get the point about
> training a maxent model with complete sentences.
>
> For example:
>
> <START:person> Pierre Vinken <END> , 61 years old , will join the board as
> a nonexecutive director Nov. 29 .
>
> it has ~20 tokens.
> As described here:
> https://opennlp.apache.org/documentation/1.6.0/manual/
> opennlp.html#tools.namefind.training.featuregen
> the default window should be 2 tokens on the left and 2 tokens on the right
> of the entity. So, what's the point of writing the entire sentence if there
> are no other entities ?
>
> As far i have understood it correctly, it should take into account the
> Pierre Vinken (as entity name) and "," "61" as the next 2 tokens. So, why
> do we need "*years old , will join the board as a nonexecutive*" ?
>
> Thank you in advance for the clarification!
>
> Best
> Damiano
>