You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Vihari Piratla <vi...@gmail.com> on 2016/01/08 05:10:07 UTC

Information on default sentence model

Hello OpenNLP community,
I am a long time OpenNLP user. I use various NLP tasks provided by OpenNLP
in my application.
I have a few basic queries regarding the SentenceDetector and the default
sentence model.

Since sentence tokenizer is a basic and the first step in many
data-processing pipelines, I am trying to make it more robust.
SentenceDetectorFactory class provides a parameter to feed the
abbreviations, which I think is very useful.
I checked the default models available from
http://opennlp.sourceforge.net/models-1.5/. The available sentence model
does not seem to use any abbreviations because the getAbbreviations on the
loaded model shows null.
If abbreviations dictionary is not used during training, the model will be
agnostic to features such as "sabbrev", "vabbrev", "xabbrev" generated by
DeafaultSDContextGenerator.collectFeatures based on the dictionary. In that
case, I am not sure if feeding in a list of abbreviations through
SentenceDetectorFactory during the evaluation will make any difference.
Am I missing something? Sorry, if I am wrong.
If I am right, please give me suggestions on alternatives.

Also, I am not sure about the purpose of useTokenEnd param of
SentenceDetectorFactory, can someone explain or point me to a resource that
explains this?

I am not sure if users-list is the right place for this post, if not please
let me know and I will move it to dev.

Thanks
-- 
Vihari Piratla