You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "William Colen (JIRA)" <ji...@apache.org> on 2017/01/02 20:10:58 UTC

[jira] [Commented] (OPENNLP-743) The chunker training data format is incorrectly/insufficiently described.

    [ https://issues.apache.org/jira/browse/OPENNLP-743?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15793422#comment-15793422 ] 

William Colen commented on OPENNLP-743:
---------------------------------------

In fact, the documentation is not clear. I checked the code and it will only work with a single space.
I took a look at CONLL-2000 and their documentation also states "three columns separated by spaces", but the provided data are formed by only one space.
IMO we should fix the documentation and keep the code as is.

> The chunker training data format is incorrectly/insufficiently described.
> -------------------------------------------------------------------------
>
>                 Key: OPENNLP-743
>                 URL: https://issues.apache.org/jira/browse/OPENNLP-743
>             Project: OpenNLP
>          Issue Type: Documentation
>          Components: Chunker
>            Reporter: Zuzana Neverilova
>            Priority: Minor
>              Labels: documentation, easyfix, newbie
>   Original Estimate: 10m
>  Remaining Estimate: 10m
>
> The chunker training data format is described as follows: The train data consist of three columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. However, in the example, several spaces are between tokens and tag. First, it looks like tabs (which are not allowed), second several spaces are not allowed as well (apparently, the line String is splitted(" ")). Suggestion: emphasize that columns are separated by one space and tabs are not allowed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)