You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Jeff Zemerick <jz...@apache.org> on 2017/09/06 17:49:45 UTC

Re: How do abbreviations work when training a sentence detector

Ade,

The abbreviations provided in the dictionary when training the model are
used to determine features of the training text. When an end-of-sentence
character is found in the training text the trainer looks to see if the
text immediately preceding the character is one of the provided
abbreviations. If it is then a feature is generated. The trained model will
then be better at differentiating between abbreviations and actual ends of
sentences in input text.

Jeff



On Wed, Sep 6, 2017 at 12:59 PM, Ade Miller <ad...@getconga.com> wrote:

>
> I train the model on a sample stream with many sentences, one per line.
> The single sentence is just a trivial test example to
> See if abbreviations work.
>
> model = trainer.train(language, sampleStream, fact, trainingParameters);
>
> It seems like I have to define an abbreviation in the dictionary and
> examples in the training data for this to work. In which case I'm not clear
> what the abbreviations dictionary actually does.
>
> -----Original Message-----
> From: Daniel Russ [mailto:druss@apache.org]
> Sent: Wednesday, September 6, 2017 9:51 AM
> To: users@opennlp.apache.org
> Subject: Re: How do abbreviations work when training a sentence detector
>
> You are trying to train a sentence detector with only 1 sentence.    Each
> line should be 1 sentence, the final character in the line marks the EOS.
> It should handle abbreviations correctly.  The idea behind the S.D. is that
> every period (or ? or ! ) is classified as EOS or notEOS.
> Daniel
>
> Please see: http://opennlp.apache.org/docs/1.8.1/manual/opennlp.
> html#tools.sentdetect <http://opennlp.apache.org/
> docs/1.8.1/manual/opennlp.html#tools.sentdetect>  for more info.
>
>
> > On Sep 6, 2017, at 12:21 PM, Ade Miller <ad...@getconga.com> wrote:
> >
> > I'm trying to train a sentence detector with a set of abbreviations but
> am not seeing the behavior I expected.
> >
> >        InputStreamFactory factory = new MarkableFileInputStreamFactory
> (trainingData);
> >        PlainTextByLineStream lineStream = new
> PlainTextByLineStream(factory, Constants.CHARSET);
> >        ObjectStream<SentenceSample> sampleStream = new
> SentenceSampleStream(lineStream);
> >
> >        Dictionary abbreviations = new AbbreviationsResourceLoader().
> load();
> >        SentenceDetectorFactory fact = new SentenceDetectorFactory(language,
> true, abbreviations, null);
> >        model = trainer.train(language, sampleStream, fact,
> trainingParameters);
> >
> >        CustomSentenceDetectorME detect = new CustomSentenceDetectorME(
> model);
> >        String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat
> on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog,
> well, it lay in Mrs. Smythe's yard.");
> >        for (String s : sentences) {
> >            LOG.info(s);
> >        }
> >
> > The output I get shows that sentences are being split on the
> abbreviations:
> >
> > The cat, Ms.
> > , sat on the mat.
> > I called 464-6859 ext.
> > 13 and asked for Mr.
> > Frank.
> > The dog, well, it lay in Mrs.
> > Smythe's yard.
> >
> > How is the abbreviation dictionary used? Does the training set also have
> to include examples of the same abbreviation(s).
> >
> > Thanks,
> >
> > Ade
>
>