You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Gregory Werner (JIRA)" <ji...@apache.org> on 2016/09/28 13:32:20 UTC

[jira] [Created] (OPENNLP-862) BRAT format packages do not handle punctuation correctly when training NER model

Gregory Werner created OPENNLP-862:
--------------------------------------

             Summary: BRAT format packages do not handle punctuation correctly when training NER model
                 Key: OPENNLP-862
                 URL: https://issues.apache.org/jira/browse/OPENNLP-862
             Project: OpenNLP
          Issue Type: Bug
          Components: Formats
    Affects Versions: 1.6.0
            Reporter: Gregory Werner


BRAT does not require preprocessing of text files in order to add annotations to text documents.  And this is great because I can feed documents from corpora I am given directly into BRAT.  If I have a line such as:

Residence:   Athens, Georgia

I would provide 2 annotations in BRAT, Athens and Georgia, and BRAT would generate the offset and everything would be fine.  

It appears though that I only get 1 entity correctly processed (and the other dropped) in OpenNLP with TokenNameFinderTrainer.brat, Georgia, because the comma is not separated from Athens.  I have 789 annotated raw, non pre-processed text documents from past efforts. I believe that OpenNLP should be able to handle lines like the above in the case of the BRAT format code.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)