You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Markus Kreuzthaler <ma...@gmail.com> on 2017/09/29 13:52:51 UTC

custom eos characters

Hello!

I state my problem again as I think it is quite similar to the following
issue:
https://issues.apache.org/jira/browse/OPENNLP-602

I work with clinical narratives so eos characters are very often just
missing, and I try to train a new robust sentence model.
From the issue above it is suggested to encode these types of endings with
<CR><LF> or just a <LF>

How do I set this up properly?

char[] eosCharacters = {'!','?','.'};
SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
true, null ,eosCharacters);

eosCharacters is a char array, how to put in your suggested encodings
'<CR><LF>', '<LF>'?

How do I have to prepare my final training data set then?
So I have for example in the text something like (with an artificial line
break in the middle of the sentence):
The quick abbr. brown
fox jumps over the lazy dog

Training:
The quick abbr. brown fox jumps over the lazy dog <CR><LF>

If the standard eos charactes {'.','?','!'} are existing:
The quick abbr. brown
fox jumps over the lazy dog.

Training:
The quick abbr. brown fox jumps over the lazy dog.

If I have an abbreviation at the end of a sentence do I have to encode this
in a special way?
The quick abbr. brown
fox jumps over the lazy dog abbr.

Training:
The quick abbr. brown fox jumps over the lazy dog abbr.

When I have trained my model, do I have to accommodate the input text to
e.g. <CR><LF> or <LF> inputs as used in the training sentences?

Thank you for your help!

lg Markus

Re: custom eos characters

Posted by Markus Kreuzthaler <ma...@gmail.com>.

Dear Dan and Jörn!

Thank you for your reply!
So I try to continue to find the right training format.

As I understand Jörn correctly it would be:

char[] eosCharacters = {'!','?','.','\n'};
SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
true, null ,eosCharacters);

From the text (including an artificial line break after "brown"):
The quick abbr. brown
fox jumps over the lazy dog

Training:
A) The quick abbr. brown fox jumps over the lazy dog <NEW_LINE>
Or
B) The quick abbr. brown <NEW_LINE> fox jumps over the lazy dog <NEW_LINE>

What is the right format after the update, A or B?

lg Markus


2017-09-29 18:56 GMT+02:00 Dan Russ <da...@gmail.com>:

> I am not suggesting we actually change anything.  Only that it is more
> complicated than adding chars to the eos array.
>
> Daniel
>
>
> > On Sep 29, 2017, at 10:44 AM, Joern Kottmann <ko...@gmail.com> wrote:
> >
> > I think it is a bit unlucky that we have two <LF> and <CR> tags. I
> > would change this and normalize it into just one tag e.g. <NEW_LINE>
> > and then allow this to be placed in our existing training format as a
> > end-of-sentence marker.
> >
> > The eos array needs to also contain that char, we can just take /n and
> > use this as a marker that we need to detect new line chars independent
> > of the platform.
> >
> > And just to remind us all, we have this problem also in other
> > components, e.g. the name finder can't take new lines into account,
> > but this is obviously needed for certain data sets like a name list
> > where each name is written in one line.
> >
> > Jörn
> >
> > On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <da...@gmail.com> wrote:
> >> Hi Markus,
> >>   Just adding the characters <CR> and <LF> to the eos array is not
> going to solve your problem.  You would need to add <CR> and <LF> to you
> training set otherwise the sentence detector will ALWAYS end the sentence
> at <CR><LF>.  Think about how the training data (including the example you
> gave).  I think this would require OpenNLP to change the format of the
> sentence detector training data, so we could see <CR> and <LF> read the
> next word and decide whether it is an end of sentence.  You would want data
> like:
> >>
> >> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of
> stomach cramps   <LF><CR><End:Sentence>
> >>
> >> In order to catch the end-of-line as a sentence delimiter.
> >>
> >> Do you see a way around it?  Comments?
> >> Daniel
> >>
> >>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler <
> markus.kreuzthaler@gmail.com> wrote:
> >>>
> >>> Hello!
> >>>
> >>> I state my problem again as I think it is quite similar to the
> following
> >>> issue:
> >>> https://issues.apache.org/jira/browse/OPENNLP-602
> >>>
> >>> I work with clinical narratives so eos characters are very often just
> >>> missing, and I try to train a new robust sentence model.
> >>> From the issue above it is suggested to encode these types of endings
> with
> >>> <CR><LF> or just a <LF>
> >>>
> >>> How do I set this up properly?
> >>>
> >>> char[] eosCharacters = {'!','?','.'};
> >>> SentenceDetectorFactory sentenceFactory = new
> SentenceDetectorFactory("de",
> >>> true, null ,eosCharacters);
> >>>
> >>> eosCharacters is a char array, how to put in your suggested encodings
> >>> '<CR><LF>', '<LF>'?
> >>>
> >>> How do I have to prepare my final training data set then?
> >>> So I have for example in the text something like (with an artificial
> line
> >>> break in the middle of the sentence):
> >>> The quick abbr. brown
> >>> fox jumps over the lazy dog
> >>>
> >>> Training:
> >>> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
> >>>
> >>> If the standard eos charactes {'.','?','!'} are existing:
> >>> The quick abbr. brown
> >>> fox jumps over the lazy dog.
> >>>
> >>> Training:
> >>> The quick abbr. brown fox jumps over the lazy dog.
> >>>
> >>> If I have an abbreviation at the end of a sentence do I have to encode
> this
> >>> in a special way?
> >>> The quick abbr. brown
> >>> fox jumps over the lazy dog abbr.
> >>>
> >>> Training:
> >>> The quick abbr. brown fox jumps over the lazy dog abbr.
> >>>
> >>> When I have trained my model, do I have to accommodate the input text
> to
> >>> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
> >>>
> >>> Thank you for your help!
> >>>
> >>> lg Markus
> >>
>
>

Re: custom eos characters

Posted by Dan Russ <da...@gmail.com>.

I am not suggesting we actually change anything.  Only that it is more complicated than adding chars to the eos array.

Daniel


> On Sep 29, 2017, at 10:44 AM, Joern Kottmann <ko...@gmail.com> wrote:
> 
> I think it is a bit unlucky that we have two <LF> and <CR> tags. I
> would change this and normalize it into just one tag e.g. <NEW_LINE>
> and then allow this to be placed in our existing training format as a
> end-of-sentence marker.
> 
> The eos array needs to also contain that char, we can just take /n and
> use this as a marker that we need to detect new line chars independent
> of the platform.
> 
> And just to remind us all, we have this problem also in other
> components, e.g. the name finder can't take new lines into account,
> but this is obviously needed for certain data sets like a name list
> where each name is written in one line.
> 
> Jörn
> 
> On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <da...@gmail.com> wrote:
>> Hi Markus,
>>   Just adding the characters <CR> and <LF> to the eos array is not going to solve your problem.  You would need to add <CR> and <LF> to you training set otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>.  Think about how the training data (including the example you gave).  I think this would require OpenNLP to change the format of the sentence detector training data, so we could see <CR> and <LF> read the next word and decide whether it is an end of sentence.  You would want data like:
>> 
>> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach cramps   <LF><CR><End:Sentence>
>> 
>> In order to catch the end-of-line as a sentence delimiter.
>> 
>> Do you see a way around it?  Comments?
>> Daniel
>> 
>>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler <ma...@gmail.com> wrote:
>>> 
>>> Hello!
>>> 
>>> I state my problem again as I think it is quite similar to the following
>>> issue:
>>> https://issues.apache.org/jira/browse/OPENNLP-602
>>> 
>>> I work with clinical narratives so eos characters are very often just
>>> missing, and I try to train a new robust sentence model.
>>> From the issue above it is suggested to encode these types of endings with
>>> <CR><LF> or just a <LF>
>>> 
>>> How do I set this up properly?
>>> 
>>> char[] eosCharacters = {'!','?','.'};
>>> SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
>>> true, null ,eosCharacters);
>>> 
>>> eosCharacters is a char array, how to put in your suggested encodings
>>> '<CR><LF>', '<LF>'?
>>> 
>>> How do I have to prepare my final training data set then?
>>> So I have for example in the text something like (with an artificial line
>>> break in the middle of the sentence):
>>> The quick abbr. brown
>>> fox jumps over the lazy dog
>>> 
>>> Training:
>>> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
>>> 
>>> If the standard eos charactes {'.','?','!'} are existing:
>>> The quick abbr. brown
>>> fox jumps over the lazy dog.
>>> 
>>> Training:
>>> The quick abbr. brown fox jumps over the lazy dog.
>>> 
>>> If I have an abbreviation at the end of a sentence do I have to encode this
>>> in a special way?
>>> The quick abbr. brown
>>> fox jumps over the lazy dog abbr.
>>> 
>>> Training:
>>> The quick abbr. brown fox jumps over the lazy dog abbr.
>>> 
>>> When I have trained my model, do I have to accommodate the input text to
>>> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
>>> 
>>> Thank you for your help!
>>> 
>>> lg Markus
>>

Re: custom eos characters

Posted by Joern Kottmann <ko...@gmail.com>.

I think it is a bit unlucky that we have two <LF> and <CR> tags. I
would change this and normalize it into just one tag e.g. <NEW_LINE>
and then allow this to be placed in our existing training format as a
end-of-sentence marker.

The eos array needs to also contain that char, we can just take /n and
use this as a marker that we need to detect new line chars independent
of the platform.

And just to remind us all, we have this problem also in other
components, e.g. the name finder can't take new lines into account,
but this is obviously needed for certain data sets like a name list
where each name is written in one line.

Jörn

On Fri, Sep 29, 2017 at 4:32 PM, Dan Russ <da...@gmail.com> wrote:
> Hi Markus,
>    Just adding the characters <CR> and <LF> to the eos array is not going to solve your problem.  You would need to add <CR> and <LF> to you training set otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>.  Think about how the training data (including the example you gave).  I think this would require OpenNLP to change the format of the sentence detector training data, so we could see <CR> and <LF> read the next word and decide whether it is an end of sentence.  You would want data like:
>
> Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach cramps   <LF><CR><End:Sentence>
>
> In order to catch the end-of-line as a sentence delimiter.
>
> Do you see a way around it?  Comments?
> Daniel
>
>> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler <ma...@gmail.com> wrote:
>>
>> Hello!
>>
>> I state my problem again as I think it is quite similar to the following
>> issue:
>> https://issues.apache.org/jira/browse/OPENNLP-602
>>
>> I work with clinical narratives so eos characters are very often just
>> missing, and I try to train a new robust sentence model.
>> From the issue above it is suggested to encode these types of endings with
>> <CR><LF> or just a <LF>
>>
>> How do I set this up properly?
>>
>> char[] eosCharacters = {'!','?','.'};
>> SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
>> true, null ,eosCharacters);
>>
>> eosCharacters is a char array, how to put in your suggested encodings
>> '<CR><LF>', '<LF>'?
>>
>> How do I have to prepare my final training data set then?
>> So I have for example in the text something like (with an artificial line
>> break in the middle of the sentence):
>> The quick abbr. brown
>> fox jumps over the lazy dog
>>
>> Training:
>> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
>>
>> If the standard eos charactes {'.','?','!'} are existing:
>> The quick abbr. brown
>> fox jumps over the lazy dog.
>>
>> Training:
>> The quick abbr. brown fox jumps over the lazy dog.
>>
>> If I have an abbreviation at the end of a sentence do I have to encode this
>> in a special way?
>> The quick abbr. brown
>> fox jumps over the lazy dog abbr.
>>
>> Training:
>> The quick abbr. brown fox jumps over the lazy dog abbr.
>>
>> When I have trained my model, do I have to accommodate the input text to
>> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
>>
>> Thank you for your help!
>>
>> lg Markus
>

Re: custom eos characters

Posted by Dan Russ <da...@gmail.com>.

Hi Markus,
   Just adding the characters <CR> and <LF> to the eos array is not going to solve your problem.  You would need to add <CR> and <LF> to you training set otherwise the sentence detector will ALWAYS end the sentence at <CR><LF>.  Think about how the training data (including the example you gave).  I think this would require OpenNLP to change the format of the sentence detector training data, so we could see <CR> and <LF> read the next word and decide whether it is an end of sentence.  You would want data like:

Patient admitted at 8:00 AM <LF><CR> <End:Sentence> He complained of stomach cramps   <LF><CR><End:Sentence>

In order to catch the end-of-line as a sentence delimiter.

Do you see a way around it?  Comments?
Daniel

> On Sep 29, 2017, at 9:52 AM, Markus Kreuzthaler <ma...@gmail.com> wrote:
> 
> Hello!
> 
> I state my problem again as I think it is quite similar to the following
> issue:
> https://issues.apache.org/jira/browse/OPENNLP-602
> 
> I work with clinical narratives so eos characters are very often just
> missing, and I try to train a new robust sentence model.
> From the issue above it is suggested to encode these types of endings with
> <CR><LF> or just a <LF>
> 
> How do I set this up properly?
> 
> char[] eosCharacters = {'!','?','.'};
> SentenceDetectorFactory sentenceFactory = new SentenceDetectorFactory("de",
> true, null ,eosCharacters);
> 
> eosCharacters is a char array, how to put in your suggested encodings
> '<CR><LF>', '<LF>'?
> 
> How do I have to prepare my final training data set then?
> So I have for example in the text something like (with an artificial line
> break in the middle of the sentence):
> The quick abbr. brown
> fox jumps over the lazy dog
> 
> Training:
> The quick abbr. brown fox jumps over the lazy dog <CR><LF>
> 
> If the standard eos charactes {'.','?','!'} are existing:
> The quick abbr. brown
> fox jumps over the lazy dog.
> 
> Training:
> The quick abbr. brown fox jumps over the lazy dog.
> 
> If I have an abbreviation at the end of a sentence do I have to encode this
> in a special way?
> The quick abbr. brown
> fox jumps over the lazy dog abbr.
> 
> Training:
> The quick abbr. brown fox jumps over the lazy dog abbr.
> 
> When I have trained my model, do I have to accommodate the input text to
> e.g. <CR><LF> or <LF> inputs as used in the training sentences?
> 
> Thank you for your help!
> 
> lg Markus