You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Jeff Zemerick (Jira)" <ji...@apache.org> on 2022/03/29 13:18:00 UTC

[jira] [Created] (OPENNLP-1363) Verify the documentation of the lemmatizer input format

Jeff Zemerick created OPENNLP-1363:
--------------------------------------

             Summary: Verify the documentation of the lemmatizer input format
                 Key: OPENNLP-1363
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1363
             Project: OpenNLP
          Issue Type: Task
          Components: Documentation
            Reporter: Jeff Zemerick


In OPENNLP-1257, a change was proposed to update the code to split the lemmatizer input by spaces instead of by tab. I believe tab is the desired delimiter but we need to make sure the documentation is consistent.

Refer to [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer|https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer.] , in particular the following sentences:

"The training data consist of three columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag and the third its lemma. Here is an example of the file format:"

Determine if that first line should read "separated by tabs" instead.

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)