You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Nicolas Hernandez (JIRA)" <ji...@apache.org> on 2012/06/27 16:47:44 UTC

[jira] [Created] (OPENNLP-515) Request for multi-words expressions (MWE) support in serialization formats

Nicolas Hernandez created OPENNLP-515:
-----------------------------------------

             Summary: Request for multi-words expressions (MWE) support in serialization formats
                 Key: OPENNLP-515
                 URL: https://issues.apache.org/jira/browse/OPENNLP-515
             Project: OpenNLP
          Issue Type: New Feature
          Components: Chunker, Command Line Interface, Coref, Doccat, Name Finder, Parser, POS Tagger
    Affects Versions: tools-1.5.3
            Reporter: Nicolas Hernandez


Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...

So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions. 
By convention, users use the underscore character to concat the words of MWE and make MWE a token.
Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.

We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.

This concerns both trainers and labelers.

A default MWE separator should be specified which will be used when serializing data with MWEs.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (OPENNLP-515) Request for multi-words expressions (MWE) support in serialization formats

Posted by "Nicolas Hernandez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Hernandez updated OPENNLP-515:
--------------------------------------

    Priority: Minor  (was: Major)
    
> Request for multi-words expressions (MWE) support in serialization formats
> --------------------------------------------------------------------------
>
>                 Key: OPENNLP-515
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-515
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Chunker, Command Line Interface, Coref, Doccat, Name Finder, Parser, POS Tagger
>    Affects Versions: tools-1.5.3
>            Reporter: Nicolas Hernandez
>            Priority: Minor
>
> Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...
> So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions. 
> By convention, users use the underscore character to concat the words of MWE and make MWE a token.
> Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.
> We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.
> This concerns both trainers and labelers.
> A default MWE separator should be specified which will be used when serializing data with MWEs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (OPENNLP-515) Request for multi-words expressions (MWE) support in serialization formats

Posted by "Nicolas Hernandez (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/OPENNLP-515?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Nicolas Hernandez updated OPENNLP-515:
--------------------------------------

    Issue Type: Improvement  (was: New Feature)
    
> Request for multi-words expressions (MWE) support in serialization formats
> --------------------------------------------------------------------------
>
>                 Key: OPENNLP-515
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-515
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Chunker, Command Line Interface, Coref, Doccat, Name Finder, Parser, POS Tagger
>    Affects Versions: tools-1.5.3
>            Reporter: Nicolas Hernandez
>
> Multi-words expressions (MWE) are expressions with whitespace-separated words like "traffic light", "in order to", "two thousand and one", "Jules Verne"...
> So far, by using the CLI to train a model (in particular a POS model), there was no way to specify what is a simple or a multi-word expressions. 
> By convention, users use the underscore character to concat the words of MWE and make MWE a token.
> Consequently a model trained by the API on the same data can be distinct since this preprocessing is not required.
> We need to offer to the users the possibility to set by parameter in the CLI what is the MWE separator char sequence.
> This concerns both trainers and labelers.
> A default MWE separator should be specified which will be used when serializing data with MWEs.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira