You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@opennlp.apache.org by "Jeff Zemerick (Jira)" <ji...@apache.org> on 2022/09/17 18:32:00 UTC

[jira] [Created] (OPENNLP-1385) Fix discrepancy in tokenizer documentation

Jeff Zemerick created OPENNLP-1385:
--------------------------------------

Summary: Fix discrepancy in tokenizer documentation
Key: OPENNLP-1385
URL: https://issues.apache.org/jira/browse/OPENNLP-1385
Project: OpenNLP
Issue Type: Task
Components: Documentation, Tokenizer
Reporter: Jeff Zemerick

In the tokenizer documentation in the user guide, the usage of the tool shows a cutoff option:
-cutoff num
minimal number of times a feature must be seen, ignored if -params is used.
However, this option is not present in the usage when running the CLI:
{quote}Arguments description:
-factory factoryName
A sub-class of TokenizerFactory where to get implementation and resources.
-abbDict path
abbreviation dictionary in XML format.
-alphaNumOpt isAlphaNumOpt
Optimization flag to skip alpha numeric tokens for further tokenization
-params paramsFile
training parameters file.
-lang language
language which is being processed.
-model modelFile
output model file.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
encoding for reading and writing text, if absent the system default is used.
{quote}
The CLI does not recognize cutoff as an option so it is likely the documentation is incorrect but a review of the code should probably be done first to be sure.

--
This message was sent by Atlassian Jira
(v8.20.10#820010)