You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@opennlp.apache.org by "Jeff Zemerick (Jira)" <ji...@apache.org> on 2022/09/17 18:32:00 UTC

[jira] [Created] (OPENNLP-1385) Fix discrepancy in tokenizer documentation

Jeff Zemerick created OPENNLP-1385:
--------------------------------------

             Summary: Fix discrepancy in tokenizer documentation
                 Key: OPENNLP-1385
                 URL: https://issues.apache.org/jira/browse/OPENNLP-1385
             Project: OpenNLP
          Issue Type: Task
          Components: Documentation, Tokenizer
            Reporter: Jeff Zemerick


In the tokenizer documentation in the user guide, the usage of the tool shows a cutoff option:
        -cutoff num
                minimal number of times a feature must be seen, ignored if -params is used.
However, this option is not present in the usage when running the CLI:
{quote}Arguments description:
        -factory factoryName
                A sub-class of TokenizerFactory where to get implementation and resources.
        -abbDict path
                abbreviation dictionary in XML format.
        -alphaNumOpt isAlphaNumOpt
                Optimization flag to skip alpha numeric tokens for further tokenization
        -params paramsFile
                training parameters file.
        -lang language
                language which is being processed.
        -model modelFile
                output model file.
        -data sampleData
                data to be used, usually a file name.
        -encoding charsetName
                encoding for reading and writing text, if absent the system default is used.
{quote}
The CLI does not recognize cutoff as an option so it is likely the documentation is incorrect but a review of the code should probably be done first to be sure.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)