You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2017/01/25 14:50:26 UTC

[jira] [Commented] (OPENNLP-957) Create a normalizer feature generator

    [ https://issues.apache.org/jira/browse/OPENNLP-957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15837849#comment-15837849 ] 

Joern Kottmann commented on OPENNLP-957:
----------------------------------------

Here we also need to look at the tokenizer, in most cases it is probably better not to tokenize those kind of strings. The tokenizer could be extended with some functions a user can activate to recognize patterns that form tokens and should not be further cut into pieces.

> Create a normalizer feature generator
> -------------------------------------
>
>                 Key: OPENNLP-957
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-957
>             Project: OpenNLP
>          Issue Type: Improvement
>            Reporter: William Colen
>             Fix For: 1.7.2
>
>
> Create an aggregate feature generator that can modify the the tokens. For example:
> - Numbers: 9838749 -> 9999999
> - Interjection: hellllloooooo -> hello
> - URL: http://apache.opennlp.org -> $URL$
> - Email: users@opennlp.apache.org -> $EMAIL$
> ...



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)