You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@opennlp.apache.org by Ade Miller <ad...@getconga.com> on 2017/09/06 16:21:59 UTC

How do abbreviations work when training a sentence detector

I'm trying to train a sentence detector with a set of abbreviations but am not seeing the behavior I expected.

        InputStreamFactory factory = new MarkableFileInputStreamFactory(trainingData);
        PlainTextByLineStream lineStream = new PlainTextByLineStream(factory, Constants.CHARSET);
        ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);

        Dictionary abbreviations = new AbbreviationsResourceLoader().load();
        SentenceDetectorFactory fact = new SentenceDetectorFactory(language, true, abbreviations, null);
        model = trainer.train(language, sampleStream, fact, trainingParameters);

        CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model);
        String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay in Mrs. Smythe's yard.");
        for (String s : sentences) {
            LOG.info(s);
        }

The output I get shows that sentences are being split on the abbreviations:

The cat, Ms.
, sat on the mat.
I called 464-6859 ext.
13 and asked for Mr.
Frank.
The dog, well, it lay in Mrs.
Smythe's yard.

How is the abbreviation dictionary used? Does the training set also have to include examples of the same abbreviation(s).

Thanks,

Ade

Re: How do abbreviations work when training a sentence detector

Posted by Jeff Zemerick <jz...@apache.org>.
Ade,

The abbreviations provided in the dictionary when training the model are
used to determine features of the training text. When an end-of-sentence
character is found in the training text the trainer looks to see if the
text immediately preceding the character is one of the provided
abbreviations. If it is then a feature is generated. The trained model will
then be better at differentiating between abbreviations and actual ends of
sentences in input text.

Jeff



On Wed, Sep 6, 2017 at 12:59 PM, Ade Miller <ad...@getconga.com> wrote:

>
> I train the model on a sample stream with many sentences, one per line.
> The single sentence is just a trivial test example to
> See if abbreviations work.
>
> model = trainer.train(language, sampleStream, fact, trainingParameters);
>
> It seems like I have to define an abbreviation in the dictionary and
> examples in the training data for this to work. In which case I'm not clear
> what the abbreviations dictionary actually does.
>
> -----Original Message-----
> From: Daniel Russ [mailto:druss@apache.org]
> Sent: Wednesday, September 6, 2017 9:51 AM
> To: users@opennlp.apache.org
> Subject: Re: How do abbreviations work when training a sentence detector
>
> You are trying to train a sentence detector with only 1 sentence.    Each
> line should be 1 sentence, the final character in the line marks the EOS.
> It should handle abbreviations correctly.  The idea behind the S.D. is that
> every period (or ? or ! ) is classified as EOS or notEOS.
> Daniel
>
> Please see: http://opennlp.apache.org/docs/1.8.1/manual/opennlp.
> html#tools.sentdetect <http://opennlp.apache.org/
> docs/1.8.1/manual/opennlp.html#tools.sentdetect>  for more info.
>
>
> > On Sep 6, 2017, at 12:21 PM, Ade Miller <ad...@getconga.com> wrote:
> >
> > I'm trying to train a sentence detector with a set of abbreviations but
> am not seeing the behavior I expected.
> >
> >        InputStreamFactory factory = new MarkableFileInputStreamFactory
> (trainingData);
> >        PlainTextByLineStream lineStream = new
> PlainTextByLineStream(factory, Constants.CHARSET);
> >        ObjectStream<SentenceSample> sampleStream = new
> SentenceSampleStream(lineStream);
> >
> >        Dictionary abbreviations = new AbbreviationsResourceLoader().
> load();
> >        SentenceDetectorFactory fact = new SentenceDetectorFactory(language,
> true, abbreviations, null);
> >        model = trainer.train(language, sampleStream, fact,
> trainingParameters);
> >
> >        CustomSentenceDetectorME detect = new CustomSentenceDetectorME(
> model);
> >        String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat
> on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog,
> well, it lay in Mrs. Smythe's yard.");
> >        for (String s : sentences) {
> >            LOG.info(s);
> >        }
> >
> > The output I get shows that sentences are being split on the
> abbreviations:
> >
> > The cat, Ms.
> > , sat on the mat.
> > I called 464-6859 ext.
> > 13 and asked for Mr.
> > Frank.
> > The dog, well, it lay in Mrs.
> > Smythe's yard.
> >
> > How is the abbreviation dictionary used? Does the training set also have
> to include examples of the same abbreviation(s).
> >
> > Thanks,
> >
> > Ade
>
>

RE: How do abbreviations work when training a sentence detector

Posted by Ade Miller <ad...@getconga.com>.
I train the model on a sample stream with many sentences, one per line. The single sentence is just a trivial test example to 
See if abbreviations work.

model = trainer.train(language, sampleStream, fact, trainingParameters);

It seems like I have to define an abbreviation in the dictionary and examples in the training data for this to work. In which case I'm not clear what the abbreviations dictionary actually does.

-----Original Message-----
From: Daniel Russ [mailto:druss@apache.org] 
Sent: Wednesday, September 6, 2017 9:51 AM
To: users@opennlp.apache.org
Subject: Re: How do abbreviations work when training a sentence detector

You are trying to train a sentence detector with only 1 sentence.    Each line should be 1 sentence, the final character in the line marks the EOS.  It should handle abbreviations correctly.  The idea behind the S.D. is that every period (or ? or ! ) is classified as EOS or notEOS.
Daniel  

Please see: http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect <http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect>  for more info.


> On Sep 6, 2017, at 12:21 PM, Ade Miller <ad...@getconga.com> wrote:
> 
> I'm trying to train a sentence detector with a set of abbreviations but am not seeing the behavior I expected.
> 
>        InputStreamFactory factory = new MarkableFileInputStreamFactory(trainingData);
>        PlainTextByLineStream lineStream = new PlainTextByLineStream(factory, Constants.CHARSET);
>        ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);
> 
>        Dictionary abbreviations = new AbbreviationsResourceLoader().load();
>        SentenceDetectorFactory fact = new SentenceDetectorFactory(language, true, abbreviations, null);
>        model = trainer.train(language, sampleStream, fact, trainingParameters);
> 
>        CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model);
>        String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay in Mrs. Smythe's yard.");
>        for (String s : sentences) {
>            LOG.info(s);
>        }
> 
> The output I get shows that sentences are being split on the abbreviations:
> 
> The cat, Ms.
> , sat on the mat.
> I called 464-6859 ext.
> 13 and asked for Mr.
> Frank.
> The dog, well, it lay in Mrs.
> Smythe's yard.
> 
> How is the abbreviation dictionary used? Does the training set also have to include examples of the same abbreviation(s).
> 
> Thanks,
> 
> Ade


Re: How do abbreviations work when training a sentence detector

Posted by Daniel Russ <dr...@apache.org>.
You are trying to train a sentence detector with only 1 sentence.    Each line should be 1 sentence, the final character in the line marks the EOS.  It should handle abbreviations correctly.  The idea behind the S.D. is that every period (or ? or ! ) is classified as EOS or notEOS.
Daniel  

Please see: http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect <http://opennlp.apache.org/docs/1.8.1/manual/opennlp.html#tools.sentdetect>  for more info.


> On Sep 6, 2017, at 12:21 PM, Ade Miller <ad...@getconga.com> wrote:
> 
> I'm trying to train a sentence detector with a set of abbreviations but am not seeing the behavior I expected.
> 
>        InputStreamFactory factory = new MarkableFileInputStreamFactory(trainingData);
>        PlainTextByLineStream lineStream = new PlainTextByLineStream(factory, Constants.CHARSET);
>        ObjectStream<SentenceSample> sampleStream = new SentenceSampleStream(lineStream);
> 
>        Dictionary abbreviations = new AbbreviationsResourceLoader().load();
>        SentenceDetectorFactory fact = new SentenceDetectorFactory(language, true, abbreviations, null);
>        model = trainer.train(language, sampleStream, fact, trainingParameters);
> 
>        CustomSentenceDetectorME detect = new CustomSentenceDetectorME(model);
>        String[] sentences = detect.sentDetect("The cat, Ms. Furry, sat on the mat. I called 464-6859 ext. 13 and asked for Mr. Frank. The dog, well, it lay in Mrs. Smythe's yard.");
>        for (String s : sentences) {
>            LOG.info(s);
>        }
> 
> The output I get shows that sentences are being split on the abbreviations:
> 
> The cat, Ms.
> , sat on the mat.
> I called 464-6859 ext.
> 13 and asked for Mr.
> Frank.
> The dog, well, it lay in Mrs.
> Smythe's yard.
> 
> How is the abbreviation dictionary used? Does the training set also have to include examples of the same abbreviation(s).
> 
> Thanks,
> 
> Ade