You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Joern Kottmann (JIRA)" <ji...@apache.org> on 2016/10/04 10:22:20 UTC

[jira] [Comment Edited] (OPENNLP-862) BRAT format packages do not handle punctuation correctly when training NER model

    [ https://issues.apache.org/jira/browse/OPENNLP-862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15544995#comment-15544995 ] 

Joern Kottmann edited comment on OPENNLP-862 at 10/4/16 10:22 AM:
------------------------------------------------------------------

OpenNLP has to tokenize its input text. Brat can avoid this by just letting the user decide how he wants to mark things. In the end you will need a tokenizer, the Whitespace Tokenizer has the issue you mentioned, the SimpleTokenizer splits on character class and will probably work better for you.

Anyway, I think it makes sense to add an option to let the Brat parser assume that annotation boundaries are also always token boundaries. It would be very nice if you could send us a patch to add this option.


was (Author: joern):
OpenNLP has to tokenize its input text. Brat can avoid this by just letting the user decide how he wants to mark things. In the end you will need a tokenizer, the Whitespace Tokenizer has the isue you mentioned, the SimpleTokenizer splits on character class and will probably work better for you.

Anyway, I think it makes sense to add an option to let the Brat parser assume that annotation boundaries are also always token boundaries. It would be very nice if you could send us a patch to add this option.

> BRAT format packages do not handle punctuation correctly when training NER model
> --------------------------------------------------------------------------------
>
>                 Key: OPENNLP-862
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-862
>             Project: OpenNLP
>          Issue Type: Bug
>          Components: Formats
>    Affects Versions: 1.6.0
>            Reporter: Gregory Werner
>
> BRAT does not require preprocessing of text files in order to add annotations to text documents.  And this is great because I can feed documents from corpora I am given directly into BRAT.  If I have a line such as:
> Residence:   Athens, Georgia
> I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would generate the offset and everything would be fine.  
> It appears though that I only get 1 entity correctly processed (and the other dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the comma is not separated from Athens.  I have 789 annotated raw, non pre-processed text documents from past efforts. I believe that OpenNLP should be able to handle lines like the above in the case of the BRAT format code.
> It appears that BratNameSampleStream uses the WhitespaceTokenizer and that is what creates Athens, as a token.  I find that the SimpleTokenizer might perform better with BRAT through my limited testing of raw documents if the current general approach is held.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)