You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2017/01/16 14:34:26 UTC

[jira] [Updated] (OPENNLP-515) Request for multi-words expressions (MWE) support in serialization formats

     [ https://issues.apache.org/jira/browse/OPENNLP-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joern Kottmann updated OPENNLP-515:
-----------------------------------
    Component/s:     (was: Coref)

> Request for multi-words expressions (MWE) support in serialization formats
> --------------------------------------------------------------------------
>
>                 Key: OPENNLP-515
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-515
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Chunker, Command Line Interface, Doccat, Name Finder, Parser, POS Tagger
>    Affects Versions: tools-1.5.3
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>
> Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...
> So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions. 
> By convention, users use the underscore character to concat the words of MWE and make MWE a token.
> Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.
> We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.
> This concerns both trainers and labelers.
> A default MWE separator should be specified which will be used when serializing data with MWEs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)