You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Anthony Beylerian (JIRA)" <ji...@apache.org> on 2015/08/12 21:37:45 UTC

[jira] [Comment Edited] (OPENNLP-801) WSD should not include pre-processing

    [ https://issues.apache.org/jira/browse/OPENNLP-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14693659#comment-14693659 ] 

Anthony Beylerian edited comment on OPENNLP-801 at 8/12/15 7:37 PM:
--------------------------------------------------------------------

Understood, we would need to pass in the required models as input parameters in this case.
The issue is that we would still need the Tokenizer, Tagger and Lemmatizer while processing.

However, I can modify the code to take in the models for those, as part of parameters as well as in the CLI.

Therefore, we will have those in the WSDParameters (or make another class WSDUtils):
- Tokenizer
- POSTagger
- Lemmatizer
- etc. 

Then require to set those explicitly for example : 
LeskParameters params = new LeskParameters();
params.setTokenizerModel(en-token.bin);
etc.

or with static access like : 
- WSDUtils.loadTokenizer(tokenizerModelFilePath);
- WSDUtils.getTokenizer().tokenize(phrase);


CLI input would also become something of the sorts :
opennlp Dimsabiguator -type Lesk -variant basic -encoding utf-8 -tokenmodel en-token.bin -taggermodel en-tag.bin -lemmodel en-lemma.dict < sentences

 is that acceptable ?



was (Author: beylerian):
Understood, we would need to pass in the required models as input parameters in this case.
I can modify the code to take in those as part of parameters as well as in the CLI.

Therefore, we will have those in the WSDParameters:
- Tokenizer
- POSTagger
- Lemmatizer
- etc. 

For example : 
LeskParameters params = new LeskParameters();
params.setTokenizerModel(en-token.bin);
etc.

CLI input would also become something of the sorts :
opennlp Dimsabiguator -type Lesk -variant basic -encoding utf-8 -tokenmodel en-token.bin -taggermodel en-tag.bin -lemmodel en-lemma.dict < sentences

 is that acceptable ?


> WSD should not include pre-processing
> -------------------------------------
>
>                 Key: OPENNLP-801
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-801
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: wsd
>            Reporter: Joern Kottmann
>         Attachments: preprocessing.patch
>
>
> The wsd component currently contains pre-processing code. This should be removed and it should instead expect already processed inputs. E.g. tokenized sentences with pos tags and maybe lemmas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)