You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Jiri Zamecnik (Jira)" <ji...@apache.org> on 2019/09/06 08:50:00 UTC

[jira] [Commented] (OPENNLP-1183) Better language model support

    [ https://issues.apache.org/jira/browse/OPENNLP-1183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16924061#comment-16924061 ] 

Jiri Zamecnik commented on OPENNLP-1183:
----------------------------------------

I started some work on this, inspired by the suggestions of [Pibiri & Venturini 2018|https://arxiv.org/pdf/1806.09447.pdf], saving the Ngrams in a trie structure. Currently, I have two implementations:
 # Trie of HashMaps: Good speed, but at the cost space demands.
 # Trie stored by arrays of integer pointers (as described in the paper, 3.2.1): Decent speed, much more space optimized.

Both of them are faster then the default (estimated on 3-grams in the MASC corpus) in terms of adding the n-grams and extracting them. So far, the trie of integer pointers is not compressed (unlike in the paper), since I couldn't find a compatibly-licensed Elias-Fano implementation for Java (except for an old version of Lucene). I would greatly appreciate suggestions there (compression methods with random access).

I am experimenting with the LRUCache for caching the probabilities.

I am now working on implementing Chen-Goodman modified Kneser-Ney and a flexible way of adding new estimators.

> Better language model support
> -----------------------------
>
>                 Key: OPENNLP-1183
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1183
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: language model
>            Reporter: Tommaso Teofili
>            Priority: Major
>
> As per [ONIP-1|https://cwiki.apache.org/confluence/display/OPENNLP/ONIP-1+Better+language+model+support] it would be nice to provide better language modelling support. This means more compact models, faster prediction, more accurate estimations.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)