You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Rodrigo Agerri (JIRA)" <ji...@apache.org> on 2016/02/18 22:13:18 UTC

[jira] [Commented] (OPENNLP-760) probabilistic lemmatizer

    [ https://issues.apache.org/jira/browse/OPENNLP-760?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15153094#comment-15153094 ] 

Rodrigo Agerri commented on OPENNLP-760:
----------------------------------------

The statistical lemmatizer has now been added. The lemmatizer takes a word, a postag and a lemma from a corpus and induces the lemma classes by calculating the permutations required to transform the word form in the lemma. This is performed on the reversed strings. The resulting permutations is the class that the statistical lemmatizer learns. Once predicted, the lemma class is decoded back into the lemma.

For better API management, the DictionaryLemmatizer API has been modified to reflect the interface of other tools in OpenNLP.

Once this issue is closed it remains to:
- Add a cmdline component for the new learnable lemmatizer.
- Add unit tests.
- Update the lemmatizer section in the documentation.

> probabilistic lemmatizer
> ------------------------
>
>                 Key: OPENNLP-760
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-760
>             Project: OpenNLP
>          Issue Type: New Feature
>          Components: Lemmatizer
>            Reporter: Rodrigo Agerri
>            Assignee: Rodrigo Agerri
>            Priority: Minor
>
> Current SimpleLemmatizer is dictionary-based. A probabilistic lemmatizer works better for unknown words and can be combined with dictionaries.
> The method we will implement here is based on: 
> Grzegorz ChrupaƂa. 2008. Towards a Machine-Learning Architecture for Lexical Functional Grammar Parsing. PhD dissertation, Dublin City University. http://grzegorz.chrupala.me/papers/phd-single.pdf



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)