You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2017/01/16 14:34:26 UTC
[jira] [Updated] (OPENNLP-515) Request for multi-words expressions
(MWE) support in serialization formats
[ https://issues.apache.org/jira/browse/OPENNLP-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joern Kottmann updated OPENNLP-515:
-----------------------------------
Component/s: (was: Coref)
> Request for multi-words expressions (MWE) support in serialization formats
> --------------------------------------------------------------------------
>
> Key: OPENNLP-515
> URL: https://issues.apache.org/jira/browse/OPENNLP-515
> Project: OpenNLP
> Issue Type: Improvement
> Components: Chunker, Command Line Interface, Doccat, Name Finder, Parser, POS Tagger
> Affects Versions: tools-1.5.3
> Reporter: Nicolas Hernandez
> Priority: Minor
>
> Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...
> So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions.
> By convention, users use the underscore character to concat the words of MWE and make MWE a token.
> Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.
> We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.
> This concerns both trainers and labelers.
> A default MWE separator should be specified which will be used when serializing data with MWEs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)