You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Anthony Beylerian (JIRA)" <ji...@apache.org> on 2015/08/05 21:06:05 UTC

[jira] [Commented] (OPENNLP-801) WSD should not include pre-processing

    [ https://issues.apache.org/jira/browse/OPENNLP-801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14658704#comment-14658704 ] 

Anthony Beylerian commented on OPENNLP-801:
-------------------------------------------

To decouple the preprocessing parts, it is different for each approach.
As mentioned the input would need to be tokenized sentences with POS tags.

particularly for Lesk : 

To reduce the preprocessing further, the tokens will have to be also filtered by relevance following the 
[relevantPOS] list defined in [opennlp.tools.disambiguator.Constants.java]. 

Otherwise, we also check for stop-words defined in [stopWords] also in [Constants.java].

Also, for the moment Lesk checks for overlaps using Stemming instead of Lemmas to get a larger coverage.
Using Lemmas still needs some testing for proper coverage.

In any case, some similar processing does have to happen on the fly, particularly when going through the each Gloss to find overlaps.

However, at the very least, I will start by decoupling the first part.

> WSD should not include pre-processing
> -------------------------------------
>
>                 Key: OPENNLP-801
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-801
>             Project: OpenNLP
>          Issue Type: Improvement
>          Components: wsd
>            Reporter: Joern Kottmann
>
> The wsd component currently contains pre-processing code. This should be removed and it should instead expect already processed inputs. E.g. tokenized sentences with pos tags and maybe lemmas.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)