You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Jeff Zemerick (Jira)" <ji...@apache.org> on 2022/03/29 13:18:00 UTC
[jira] [Created] (OPENNLP-1363) Verify the documentation of the lemmatizer input format
Jeff Zemerick created OPENNLP-1363:
--------------------------------------
Summary: Verify the documentation of the lemmatizer input format
Key: OPENNLP-1363
URL: https://issues.apache.org/jira/browse/OPENNLP-1363
Project: OpenNLP
Issue Type: Task
Components: Documentation
Reporter: Jeff Zemerick
In OPENNLP-1257, a change was proposed to update the code to split the lemmatizer input by spaces instead of by tab. I believe tab is the desired delimiter but we need to make sure the documentation is consistent.
Refer to [https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer|https://opennlp.apache.org/docs/1.9.4/manual/opennlp.html#tools.lemmatizer.] , in particular the following sentences:
"The training data consist of three columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag and the third its lemma. Here is an example of the file format:"
Determine if that first line should read "separated by tabs" instead.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)