You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@opennlp.apache.org by Joern Kottmann <ko...@gmail.com> on 2016/02/17 10:00:37 UTC

Language Model contribution

Hello,

I saw the language model commit. Thanks for contributing that!

Would it be possible to get a short introduction to it?

The interface is supposed to take a StringList. Wouldn't it be better if a
user can just pass in a String instead? Otherwise he has to worry about
tokenizing a string in a language he doesn't know. I think that should be
the task of the language detector.

Can we come up with another name for the package? Maybe langid/langdetect
or something similar? Any opinions?

The Model in LanguageModel we usually use to refer to machine learning
models, maybe we could rename this interface to LanguageDetector.

Jörn

Re: Language Model contribution

Posted by Tommaso Teofili <to...@gmail.com>.

Hi Jörn,

good you're ok with the LanguageModel API; currently the only existing
implementation is the NGramLanguageModel.
In order to create such a model you add ngrams to it as in NGramModel:

> LanguageModel languageModel = new NGramLanguageModel(*3*); // trigram
language model
> languageModel.add(new StringList(tokens), 1, *3*); // uni/bi/tri-grams
for tokenized text (StringList)

Once done with adding ngrams you can compute probability of a e.g. a
tokenized sentence with:

> double p = languageModel.calculateProbability(new StringList("neural",
"network", "language"));

Internally then it uses Laplace smoothing [1] for computing probabilities
if |ngrams| < 1M, otherwise it uses Stupid Backoff [2].
You can also use the LM to predict the next ngram given a sequence of
tokens (but that iterates over all the ngrams in order to find the most
probable and could be slow).

> StringList tokens = languageModel.predictNextTokens(new StringList(
"neural", "network", "language"));
> assertEquals(new StringList("models"), tokens);

One can quickly have a look at its usage by looking at the
NgramLanguageModelTest#testTrigramLanguageModelCreationFromText [3].

Hope this helps and of course if there're any additional questions, feel
free to ask.
Regards,
Tommaso

[1] : https://en.wikipedia.org/wiki/Additive_smoothing
[2] : http://www.aclweb.org/anthology/D07-1090.pdf
[3] :
https://github.com/apache/opennlp/blob/trunk/opennlp-tools/src/test/java/opennlp/tools/languagemodel/NgramLanguageModelTest.java#L131

Il giorno mer 17 feb 2016 alle ore 19:39 Joern Kottmann <ko...@gmail.com>
ha scritto:

> Ups, confused the language model you were working on with language
> detection.
> I think the interface is good as it is.
>
> Jörn
>
> On Wed, Feb 17, 2016 at 10:00 AM, Joern Kottmann <ko...@gmail.com>
> wrote:
>
> > Hello,
> >
> > I saw the language model commit. Thanks for contributing that!
> >
> > Would it be possible to get a short introduction to it?
> >
> > The interface is supposed to take a StringList. Wouldn't it be better if
> a
> > user can just pass in a String instead? Otherwise he has to worry about
> > tokenizing a string in a language he doesn't know. I think that should be
> > the task of the language detector.
> >
> > Can we come up with another name for the package? Maybe langid/langdetect
> > or something similar? Any opinions?
> >
> > The Model in LanguageModel we usually use to refer to machine learning
> > models, maybe we could rename this interface to LanguageDetector.
> >
> > Jörn
> >
>

Re: Language Model contribution

Posted by Joern Kottmann <ko...@gmail.com>.

Ups, confused the language model you were working on with language
detection.
I think the interface is good as it is.

Jörn

On Wed, Feb 17, 2016 at 10:00 AM, Joern Kottmann <ko...@gmail.com> wrote:

> Hello,
>
> I saw the language model commit. Thanks for contributing that!
>
> Would it be possible to get a short introduction to it?
>
> The interface is supposed to take a StringList. Wouldn't it be better if a
> user can just pass in a String instead? Otherwise he has to worry about
> tokenizing a string in a language he doesn't know. I think that should be
> the task of the language detector.
>
> Can we come up with another name for the package? Maybe langid/langdetect
> or something similar? Any opinions?
>
> The Model in LanguageModel we usually use to refer to machine learning
> models, maybe we could rename this interface to LanguageDetector.
>
> Jörn
>