You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "John Andrunas (Jira)" <ji...@apache.org> on 2019/12/04 00:32:00 UTC

[jira] [Issue Comment Deleted] (OPENNLP-1201) add bailout way for certain languages in order to use POS features

     [ https://issues.apache.org/jira/browse/OPENNLP-1201?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

John Andrunas updated OPENNLP-1201:
-----------------------------------
    Comment: was deleted

(was: Simon Poortman
The Netherlands is crazy
If you're arrested, they're throwing your psychiatry into it. War is perhaps the only solution who helps me that this doesn't have to happen? War is just what democracy in the Union for the Benelux 1 bundesland is to make and new law and care sisteem help me!!)

> add bailout way for certain languages in order to use POS features
> ------------------------------------------------------------------
>
>                 Key: OPENNLP-1201
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-1201
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: Command Line Interface, Formats
>    Affects Versions: 1.8.4
>            Reporter: Koji Sekiguchi
>            Assignee: Koji Sekiguchi
>            Priority: Major
>
> As OpenNLP tools depend on the fact that text being processed needs to be tokenized in advance (in other words, words in the text are separated each other by space), it is difficult for uses who use certain languages (e.g. CJK) to use POS (Part-of-Speech) features.
> To simplify the explanation, consider using NameFinder for Japanese text. In NameFinder tools (Train, Eval, Recognize), they require that users should provide Japanese text which has already been tokenized, but once we tokenize Japanese text, it loses POS information. (I think Chinese language has same problem)
> Let me describe this problem for western language users :) (English, French, Italian, etc.) without using Japanese letters. I’ll try to use English alphabet, instead.
> Suppose you have a sentence text “isentthemachine” which you want to give NameFinder, you use morphological analyzer in order to tokenize the sentence. There are two possible sequence of tokens:
> - i (PPSS) / sent (VBD) / the (AT) / machine (NP)
> - i (PPSS) / sent (VBD) / them (PPO) / a (AT) / chine (NP)
> As you noticed, morphological analyzer not only tokenizes the sentence, but also tags POS tag to each token. Same thing takes place in Japanese language (and Chinese language, I think).
> However, in OpenNLP feature generator API, it accepts sequence of tokens thru API i.e. `String[] tokens`, I cannot produce POS feature in the feature generator.
> To solve this problem (and to invite many users to our community), I’d like to suggest that OpenNLP tools allow users to add optional information to each tokenized word.
> For example, one can give the following text when using NameFinder tools.
> {code}
> $ cat en-ner.train
> I/PPSS sent/VBD the/AT machine/NP
> {code}
> When using such text, they must inform the tool that the token has POS tag in the text by using a certain option e.g. -postag
> {code}
> $ opennlp TokenNameFinderTrainer -data en-ner.train -model en-ner.bin -postag
> {code}
> We can maintain the backward compatibility to set -postag false by default and in this case, existing feature generators work exactly the same as before. If a user set -postag option in the command line, the existing feature generators eliminate “/POS” part of token “word/POS” in the text so that they can produce same features as before.
> I’d like to add a simple feature generator which generates only “POS” part of token “word/POS” in the text, in addition to managing -postag option. This simple feature generator allows Japanese/Chinese users to produce precise POS features.
> I’d like to focus on NameFinder in this ticket (Let me add this option to other tools (chunker, classifier, etc.) in another ticket, if needed).



--
This message was sent by Atlassian Jira
(v8.3.4#803005)