You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Khurram <kh...@gmail.com> on 2011/01/22 07:22:11 UTC

train SentenceDetectorME

How can i train SentenceDectectorME so that it does not treat dates written
like mm.dd.yyyy. as end of sentence. i tried giving a few examples in
sentences.txt and re-running the test but it always seem to treat the dots
as end of sentence...

is there a least number of time the training model has to see the pattern as
a word within sentence before it learns that the dots are not indicators of
end of sentence.

Khurram.

Re: train SentenceDetectorME

Posted by Khurram <kh...@gmail.com>.

the language is english but i am trying to understand the learning model so
i can make it work in different scenarios... any decimal point like 3.14 any
birthdate any acronym should not signify end of sentence... probably looking
for something like Regex Name finder.

thanks,
Khurram

On Sat, Jan 22, 2011 at 2:52 AM, Jörn Kottmann <ko...@gmail.com> wrote:

> On 1/22/11 7:22 AM, Khurram wrote:
>
>> How can i train SentenceDectectorME so that it does not treat dates
>> written
>> like mm.dd.yyyy. as end of sentence. i tried giving a few examples in
>> sentences.txt and re-running the test but it always seem to treat the dots
>> as end of sentence...
>>
>
> To be able to better help you, we need to know which language you want to
> train the sentence detector for.
>
> To get good results you should try training it with a few thousand
> sentences,
> the few lines in our regression test data is not enough to produce a model
> that
> can be used.
>
>
>  is there a least number of time the training model has to see the pattern
>> as
>> a word within sentence before it learns that the dots are not indicators
>> of
>> end of sentence.
>>
>
> There is a cutoff which has a default of 5, so every feature which should
> be part
> of the model must been seen at least as often as the cutoff value.
>
> Depending on your language we might be able to point you to training data.
>
> There is also a bit documentation about the sentence detector in our
> opennlp-docs project,
> if you think something is missing there we would really appreciate to
> receive a patch
> for it.
>
> Jörn
>

Re: train SentenceDetectorME

Posted by Jörn Kottmann <ko...@gmail.com>.

On 1/22/11 7:22 AM, Khurram wrote:
> How can i train SentenceDectectorME so that it does not treat dates written
> like mm.dd.yyyy. as end of sentence. i tried giving a few examples in
> sentences.txt and re-running the test but it always seem to treat the dots
> as end of sentence...

To be able to better help you, we need to know which language you want to
train the sentence detector for.

To get good results you should try training it with a few thousand 
sentences,
the few lines in our regression test data is not enough to produce a 
model that
can be used.

> is there a least number of time the training model has to see the pattern as
> a word within sentence before it learns that the dots are not indicators of
> end of sentence.

There is a cutoff which has a default of 5, so every feature which 
should be part
of the model must been seen at least as often as the cutoff value.

Depending on your language we might be able to point you to training data.

There is also a bit documentation about the sentence detector in our 
opennlp-docs project,
if you think something is missing there we would really appreciate to 
receive a patch
for it.

Jörn