You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@opennlp.apache.org by Siarhei Rusak <ru...@gmail.com> on 2014/03/27 13:27:09 UTC

SentenceDetector & Abbreviations

Hello,

Seems, I'm doing something wrong, but documentation & forum isn't very
helpful in my case.
My goal is to add abbreviations to SentenceDetector, but I can't succeed.
I'm trying to use this constructor overload:

public *SentenceModel*(String
<http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html>
languageCode,
                     opennlp.model.AbstractModel sentModel,
                     boolean useTokenEnd, Dictionary
<http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html>
abbreviations)

and a trivial model from OpenNlp repository.

Here is a code example (it's C# port via IKVM. Don't be confused) :

var abbreviations = new Dictionary();
abbreviations.put(new StringList("corp."));

var modelPath = @"....\sent.model"; //path to file, extracted from
"en-sent.bin"
var dataStream = new DataInputStream(new FileInputStream(modelPath));
var sentenceModel = new BinaryGISModelReader(dataStream).getModel();
var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel, true,
abbreviations);
                        .............................

                        var sentenceSplitter = new SentenceDetectorME(
abbreviatedSentenceModel);
sentenceSplitter.sentDetect(text);

The result of it's execution is the same, as though there wouldn't be any
abbreviations dictionary.
So I suppose that either there should be any other way to do this, either
it's a bug.
Could you help, please.

Thanks In Advance,
Siarhei.

Re: SentenceDetector & Abbreviations

Posted by William Colen <wi...@gmail.com>.

Exactly. I just checked the English sentence detector model and it was not
trained with an abbreviation dictionary. In this case I believe including
one during runtime has no effect.


2014-03-28 10:54 GMT-03:00 Siarhei Rusak <ru...@gmail.com>:

> Hello, William.
>
> My goal was to use existing (default one) sentence model, but "to add some
> abbreviations".
> If I understood you correctly, there is no way to do that, because I need
> my own sample data, which I can not extract somehow from existing model. Is
> that correct?
>
> Thanks,
> Siarhei.
>
> 2014-03-27 16:45 GMT+03:00 William Colen <wi...@gmail.com>:
>
> > Siarhei,
> >
> > The abbreviation dictionary is used both during training and execution
> > time. OpenNLP will use it during training time while extracting features
> > from training data. It will check if a token is present in the
> dictionary,
> > and if there is a match, it will add a feature to the model. During
> > runtime, the featurizer will, among other things, check if a token can be
> > an abbreviation, and add it to the list of features which will be used to
> > decide if it is a sentence separator or not.
> >
> > In this case, you need to keep in mind that:
> > 1) It is _not_ enough to have a match between a token and an entry in the
> > abbreviation dictionary to OpenNLP understand that it is an abbreviation,
> > it will take into account all the context to decide.
> > 2) Training is important. If there wasn't an abbreviation dictionary
> during
> > training, or if the training data does not contain any abbreviation
> > matching the abbreviations in the dictionary, OpenNLP will never add a
> > abbreviation dictionary feature to the model. It means that during
> runtime
> > it will not know what to do when an abbreviation dictionary feature is
> > found.
> >
> > To understand it better, you can extract the model using a Zip utility
> and
> > take a look at the abbreviation dictionary inside it. You can check if
> > "corp." is there, and also try a few other abbreviations to check the
> > behavior.
> >
> > Regards,
> > William
> >
> >
> >
> > 2014-03-27 9:27 GMT-03:00 Siarhei Rusak <ru...@gmail.com>:
> >
> > > Hello,
> > >
> > > Seems, I'm doing something wrong, but documentation & forum isn't very
> > > helpful in my case.
> > > My goal is to add abbreviations to SentenceDetector, but I can't
> succeed.
> > > I'm trying to use this constructor overload:
> > >
> > > public *SentenceModel*(String
> > > <
> http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html>
> > > languageCode,
> > >                      opennlp.model.AbstractModel sentModel,
> > >                      boolean useTokenEnd, Dictionary
> > > <
> > >
> >
> http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html
> > > >
> > > abbreviations)
> > >
> > > and a trivial model from OpenNlp repository.
> > >
> > > Here is a code example (it's C# port via IKVM. Don't be confused) :
> > >
> > > var abbreviations = new Dictionary();
> > > abbreviations.put(new StringList("corp."));
> > >
> > > var modelPath = @"....\sent.model"; //path to file, extracted from
> > > "en-sent.bin"
> > > var dataStream = new DataInputStream(new FileInputStream(modelPath));
> > > var sentenceModel = new BinaryGISModelReader(dataStream).getModel();
> > > var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel,
> > true,
> > > abbreviations);
> > >                         .............................
> > >
> > >                         var sentenceSplitter = new SentenceDetectorME(
> > > abbreviatedSentenceModel);
> > > sentenceSplitter.sentDetect(text);
> > >
> > > The result of it's execution is the same, as though there wouldn't be
> any
> > > abbreviations dictionary.
> > > So I suppose that either there should be any other way to do this,
> either
> > > it's a bug.
> > > Could you help, please.
> > >
> > > Thanks In Advance,
> > > Siarhei.
> > >
> >
>
>
>
> --
> С уважением, Русак С.
>

Re: SentenceDetector & Abbreviations

Posted by Siarhei Rusak <ru...@gmail.com>.

Hello, William.

My goal was to use existing (default one) sentence model, but "to add some
abbreviations".
If I understood you correctly, there is no way to do that, because I need
my own sample data, which I can not extract somehow from existing model. Is
that correct?

Thanks,
Siarhei.

2014-03-27 16:45 GMT+03:00 William Colen <wi...@gmail.com>:

> Siarhei,
>
> The abbreviation dictionary is used both during training and execution
> time. OpenNLP will use it during training time while extracting features
> from training data. It will check if a token is present in the dictionary,
> and if there is a match, it will add a feature to the model. During
> runtime, the featurizer will, among other things, check if a token can be
> an abbreviation, and add it to the list of features which will be used to
> decide if it is a sentence separator or not.
>
> In this case, you need to keep in mind that:
> 1) It is _not_ enough to have a match between a token and an entry in the
> abbreviation dictionary to OpenNLP understand that it is an abbreviation,
> it will take into account all the context to decide.
> 2) Training is important. If there wasn't an abbreviation dictionary during
> training, or if the training data does not contain any abbreviation
> matching the abbreviations in the dictionary, OpenNLP will never add a
> abbreviation dictionary feature to the model. It means that during runtime
> it will not know what to do when an abbreviation dictionary feature is
> found.
>
> To understand it better, you can extract the model using a Zip utility and
> take a look at the abbreviation dictionary inside it. You can check if
> "corp." is there, and also try a few other abbreviations to check the
> behavior.
>
> Regards,
> William
>
>
>
> 2014-03-27 9:27 GMT-03:00 Siarhei Rusak <ru...@gmail.com>:
>
> > Hello,
> >
> > Seems, I'm doing something wrong, but documentation & forum isn't very
> > helpful in my case.
> > My goal is to add abbreviations to SentenceDetector, but I can't succeed.
> > I'm trying to use this constructor overload:
> >
> > public *SentenceModel*(String
> > <http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html>
> > languageCode,
> >                      opennlp.model.AbstractModel sentModel,
> >                      boolean useTokenEnd, Dictionary
> > <
> >
> http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html
> > >
> > abbreviations)
> >
> > and a trivial model from OpenNlp repository.
> >
> > Here is a code example (it's C# port via IKVM. Don't be confused) :
> >
> > var abbreviations = new Dictionary();
> > abbreviations.put(new StringList("corp."));
> >
> > var modelPath = @"....\sent.model"; //path to file, extracted from
> > "en-sent.bin"
> > var dataStream = new DataInputStream(new FileInputStream(modelPath));
> > var sentenceModel = new BinaryGISModelReader(dataStream).getModel();
> > var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel,
> true,
> > abbreviations);
> >                         .............................
> >
> >                         var sentenceSplitter = new SentenceDetectorME(
> > abbreviatedSentenceModel);
> > sentenceSplitter.sentDetect(text);
> >
> > The result of it's execution is the same, as though there wouldn't be any
> > abbreviations dictionary.
> > So I suppose that either there should be any other way to do this, either
> > it's a bug.
> > Could you help, please.
> >
> > Thanks In Advance,
> > Siarhei.
> >
>



-- 
С уважением, Русак С.

Re: SentenceDetector & Abbreviations

Posted by William Colen <wi...@gmail.com>.

Siarhei,

The abbreviation dictionary is used both during training and execution
time. OpenNLP will use it during training time while extracting features
from training data. It will check if a token is present in the dictionary,
and if there is a match, it will add a feature to the model. During
runtime, the featurizer will, among other things, check if a token can be
an abbreviation, and add it to the list of features which will be used to
decide if it is a sentence separator or not.

In this case, you need to keep in mind that:
1) It is _not_ enough to have a match between a token and an entry in the
abbreviation dictionary to OpenNLP understand that it is an abbreviation,
it will take into account all the context to decide.
2) Training is important. If there wasn't an abbreviation dictionary during
training, or if the training data does not contain any abbreviation
matching the abbreviations in the dictionary, OpenNLP will never add a
abbreviation dictionary feature to the model. It means that during runtime
it will not know what to do when an abbreviation dictionary feature is
found.

To understand it better, you can extract the model using a Zip utility and
take a look at the abbreviation dictionary inside it. You can check if
"corp." is there, and also try a few other abbreviations to check the
behavior.

Regards,
William



2014-03-27 9:27 GMT-03:00 Siarhei Rusak <ru...@gmail.com>:

> Hello,
>
> Seems, I'm doing something wrong, but documentation & forum isn't very
> helpful in my case.
> My goal is to add abbreviations to SentenceDetector, but I can't succeed.
> I'm trying to use this constructor overload:
>
> public *SentenceModel*(String
> <http://download.oracle.com/javase/1.5.0/docs/api/java/lang/String.html>
> languageCode,
>                      opennlp.model.AbstractModel sentModel,
>                      boolean useTokenEnd, Dictionary
> <
> http://opennlp.apache.org/documentation/1.5.3/apidocs/opennlp-tools/opennlp/tools/dictionary/Dictionary.html
> >
> abbreviations)
>
> and a trivial model from OpenNlp repository.
>
> Here is a code example (it's C# port via IKVM. Don't be confused) :
>
> var abbreviations = new Dictionary();
> abbreviations.put(new StringList("corp."));
>
> var modelPath = @"....\sent.model"; //path to file, extracted from
> "en-sent.bin"
> var dataStream = new DataInputStream(new FileInputStream(modelPath));
> var sentenceModel = new BinaryGISModelReader(dataStream).getModel();
> var abbreviatedSentenceModel = new SentenceModel("en", sentenceModel, true,
> abbreviations);
>                         .............................
>
>                         var sentenceSplitter = new SentenceDetectorME(
> abbreviatedSentenceModel);
> sentenceSplitter.sentDetect(text);
>
> The result of it's execution is the same, as though there wouldn't be any
> abbreviations dictionary.
> So I suppose that either there should be any other way to do this, either
> it's a bug.
> Could you help, please.
>
> Thanks In Advance,
> Siarhei.
>