You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Tim Miller <ti...@childrens.harvard.edu> on 2013/06/06 15:48:45 UTC

Re: sentence detector newline behavior

Hi opennlp,
I started a thread on ctakes-dev about training the sentence detector to 
allow newlines in the middle of sentences, Jorn said it was possible, 
now I have a question about how to proceed.

I've replaced all newlines with a special character (ß) and built a 
small training file of sentences. I have a question about opennlp 
training file format that I couldn't find in the documentation. At the 
end of a section, there might be a period, multiple newlines, and some 
miscellaneous whitespace:

    This concludes the recording of family history.ß    ß ßHISTORY OF
    PRESENT ILLNESS:

Now, for downstream processing we probably want one sentence ending with 
"..history." and the next beginning "HISTORY..."
But what does that mean for the file format? Should it include all the 
newlines and other whitespace between the period (which "officially" 
ends the sentence) and the start of the next sentence? If so, does it go 
at the end of the first line or the beginning of the second? Does the 
algorithm even use this info?
Sorry about the barrage of questions and thanks for your help with this. 
It's already coming along nicely but just want to make sure I'm doing it 
optimally.
Tim

On 05/23/2013 01:52 PM, Tim Miller wrote:
> OK I've started doing this, was able to get training working on a very 
> small example, will try doing slightly bigger.
> Tim
>
> On 05/22/2013 08:03 AM, Jörn Kottmann wrote:
>> On 05/22/2013 01:17 PM, Miller, Timothy wrote:
>>> That's awesome! It might be worth trying at least. How does the 
>>> training
>>> process change? Previously the training data would be one sentence per
>>> line, but with newlines as possible mid-sentence characters that could
>>> be trouble, is there a new representation for training data? Or 
>>> would we
>>> have to use the training api?
>>
>> Good point, yes that will be a problem with the default training 
>> format, but it shouldn't be hard
>> to solve. In the format itself we could define a new line tag e.g. 
>> <NEWLINE> to mark new lines.
>> as a hack to make it work with 1.5.3 you could instead use a special 
>> char as a replacement
>> for the new line char.
>> When you pass the text down to the sentence detector a simple string 
>> replace could be used to
>> convert all new line chars to the special new line marker char.
>>
>> If things work out for you performance wise as well we will just 
>> integrate it properly into OpenNLP
>> for the next release.
>>
>> Could you produce a sentence detector training file with a new line 
>> marker char?
>>
>> You should try to pick a char you can also pass in on a terminal 
>> otherwise you have to use the
>> API to train the model. The build in cross validation could be used 
>> to evaluate the performance.
>>
>> Jörn
>


Re: sentence detector newline behavior

Posted by Jörn Kottmann <ko...@gmail.com>.
On 06/06/2013 03:48 PM, Tim Miller wrote:
> Hi opennlp,
> I started a thread on ctakes-dev about training the sentence detector 
> to allow newlines in the middle of sentences, Jorn said it was 
> possible, now I have a question about how to proceed.
>
> I've replaced all newlines with a special character (ß) and built a 
> small training file of sentences. I have a question about opennlp 
> training file format that I couldn't find in the documentation. At the 
> end of a section, there might be a period, multiple newlines, and some 
> miscellaneous whitespace:
>
>    This concludes the recording of family history.ß    ß ßHISTORY OF
>    PRESENT ILLNESS:
>
> Now, for downstream processing we probably want one sentence ending 
> with "..history." and the next beginning "HISTORY..."
> But what does that mean for the file format? Should it include all the 
> newlines and other whitespace between the period (which "officially" 
> ends the sentence) and the start of the next sentence? If so, does it 
> go at the end of the first line or the beginning of the second? Does 
> the algorithm even use this info?
> Sorry about the barrage of questions and thanks for your help with 
> this. It's already coming along nicely but just want to make sure I'm 
> doing it optimally.

I had a look at the code, and as far as I can see the white spaces are 
just assumed to be between
two sentences if you train with a training file. If you use the API 
directly that assumption is not made, so in case
you have some UIMA CASes with sentence markup its probably easier to 
train directly without using the training format.

The new line chars (ß in your case) should be there and you probably 
want them to be consistent. In the current implementation a training
event will be generated for each of them, so maybe if you have this case 
I would suggest just to attach them to the front of
the next line, rather than having them in the end of the previous line, 
but you might just want to experiment a bit what gives the best
results.

The current sentence detector has an optimization that it removes all 
new lines and white spaces around a detected sentence,
with your modification that will not work. New lines will be included in 
the returned sentence span.

If you want to have a look at the feature generation, its defined here:
opennlp.tools.sentdetect.DefaultSDContextGenerator

You can also implement your own feature generation or copy and hack the 
default one, for this you need to define a
custom factory which sets the sentence detector up during instantiation. 
I can provide you with a sample if you want to try it.
During training you can just hand in the class name of the factory and 
it will then be used from then on for training and later
during model instantiation automatically.

Don't hesitate do ask further questions.

HTH,
Jörn