You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by co...@apache.org on 2017/05/09 16:13:25 UTC
opennlp git commit: OPENNLP-1052: Update README and CLI docbook
before release
Repository: opennlp
Updated Branches:
refs/heads/master 3ab6698b6 -> db9c511e8
OPENNLP-1052: Update README and CLI docbook before release
closes apache/opennlp#195
Project: http://git-wip-us.apache.org/repos/asf/opennlp/repo
Commit: http://git-wip-us.apache.org/repos/asf/opennlp/commit/db9c511e
Tree: http://git-wip-us.apache.org/repos/asf/opennlp/tree/db9c511e
Diff: http://git-wip-us.apache.org/repos/asf/opennlp/diff/db9c511e
Branch: refs/heads/master
Commit: db9c511e8d5c3665eb2bb31cf0b11c0302252d45
Parents: 3ab6698
Author: William D C M SILVA <co...@apache.org>
Authored: Tue May 9 13:09:46 2017 -0300
Committer: William D C M SILVA <co...@apache.org>
Committed: Tue May 9 13:09:46 2017 -0300
----------------------------------------------------------------------
opennlp-distr/README | 29 +-
opennlp-docs/src/docbkx/cli.xml | 582 +++++++++++++++++++++--------------
2 files changed, 364 insertions(+), 247 deletions(-)
----------------------------------------------------------------------
http://git-wip-us.apache.org/repos/asf/opennlp/blob/db9c511e/opennlp-distr/README
----------------------------------------------------------------------
diff --git a/opennlp-distr/README b/opennlp-distr/README
index 12dc8ec..975c651 100644
--- a/opennlp-distr/README
+++ b/opennlp-distr/README
@@ -19,22 +19,25 @@ What is new in Apache OpenNLP ${pom.version}
---------------------------------------
This release introduces many new features, improvements and bug fixes. The API
-has been improved for a better consistency and 1.4 deprecated methods were
-removed. Now Java 1.8 is required.
+has been improved for a better consistency and many deprecated methods were
+removed. Java 1.8 is required.
Additionally the release contains the following noteworthy changes:
-- Name Finder evaluation can now show a confusion matrix
-- The default evaluation output contains more details
-- Added a Language Model CLI tool
-- Add Moses format support
-- More refactoring and cleanup, specially in Machine Learning package and Dictionary
-- Removed deprecated trainers from UIMA integration
-- Fixed potential localization issues and added maven plugin to prevent it (ForbiddenAPI)
-- Fixed issues with the BRAT corpus reader
-- Deprecated GIS class, will be removed in a future 1.8.x release
+- POS Tagger context generator now supports feature generation XML
+- Add a Name Finder feature generator that adds POS Tag features
+- Add CONLL-U format support
+- Improve default Name Finder settings
+- TokenNameFinderEvaluator CLI now support nameTypes argument
+- Stupid backoff is now the default in NGramLanguageModel
+- Language codes now are ISO 639-3 compliant
+- Add many unit tests
+- Distribution package now includes example parameters file
+- Now prefix and suffix feature generators are configurable
+- Remove API in Document Categorizer for user specified tokenizer
+- Learnable lemmatizer now returns all possible lemmas for a given word and pos tag
+- Add stemmer, detokenizer and sentence detection abbreviations for Irish
+- Chunker SequenceValidator signature changed to allow access to both token and POS tag
A detailed list of the issues related to this release can be found in the release
notes.
-
-
http://git-wip-us.apache.org/repos/asf/opennlp/blob/db9c511e/opennlp-docs/src/docbkx/cli.xml
----------------------------------------------------------------------
diff --git a/opennlp-docs/src/docbkx/cli.xml b/opennlp-docs/src/docbkx/cli.xml
index 3dc66b7..1a8c326 100644
--- a/opennlp-docs/src/docbkx/cli.xml
+++ b/opennlp-docs/src/docbkx/cli.xml
@@ -42,7 +42,7 @@ under the License.
<title>Doccat</title>
-<para>Learnable document categorizer</para>
+<para>Learned document categorizer</para>
<screen>
<![CDATA[
@@ -60,15 +60,15 @@ Usage: opennlp Doccat model < documents
<screen>
<![CDATA[
-Usage: opennlp DoccatTrainer[.leipzig] [-factory factoryName] [-tokenizer tokenizer] [-featureGenerators fg]
+Usage: opennlp DoccatTrainer[.leipzig] [-factory factoryName] [-featureGenerators fg] [-tokenizer tokenizer]
[-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of DoccatFactory where to get implementation and resources.
- -tokenizer tokenizer
- Tokenizer implementation. WhitespaceTokenizer is used if not specified.
-featureGenerators fg
Comma separated feature generator classes. Bag of words is used if not specified.
+ -tokenizer tokenizer
+ Tokenizer implementation. WhitespaceTokenizer is used if not specified.
-params paramsFile
training parameters file.
-lang language
@@ -113,13 +113,13 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp DoccatEvaluator[.leipzig] [-misclassified true|false] -model model [-reportOutputFile
+Usage: opennlp DoccatEvaluator[.leipzig] -model model [-misclassified true|false] [-reportOutputFile
outputFile] -data sampleData [-encoding charsetName]
Arguments description:
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
@@ -160,20 +160,20 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp DoccatCrossValidator[.leipzig] [-folds num] [-misclassified true|false] [-factory factoryName]
- [-tokenizer tokenizer] [-featureGenerators fg] [-params paramsFile] -lang language [-reportOutputFile
+Usage: opennlp DoccatCrossValidator[.leipzig] [-misclassified true|false] [-folds num] [-factory factoryName]
+ [-featureGenerators fg] [-tokenizer tokenizer] [-params paramsFile] -lang language [-reportOutputFile
outputFile] -data sampleData [-encoding charsetName]
Arguments description:
- -folds num
- number of folds, default is 10.
-misclassified true|false
if true will print false negatives and false positives.
+ -folds num
+ number of folds, default is 10.
-factory factoryName
A sub-class of DoccatFactory where to get implementation and resources.
- -tokenizer tokenizer
- Tokenizer implementation. WhitespaceTokenizer is used if not specified.
-featureGenerators fg
Comma separated feature generator classes. Bag of words is used if not specified.
+ -tokenizer tokenizer
+ Tokenizer implementation. WhitespaceTokenizer is used if not specified.
-params paramsFile
training parameters file.
-lang language
@@ -351,18 +351,18 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -463,13 +463,13 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp TokenizerMEEvaluator[.ad|.pos|.conllx|.namefinder|.parse] [-misclassified true|false] -model
- model -data sampleData [-encoding charsetName]
+Usage: opennlp TokenizerMEEvaluator[.ad|.pos|.conllx|.namefinder|.parse] -model model [-misclassified
+ true|false] -data sampleData [-encoding charsetName]
Arguments description:
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
@@ -490,18 +490,18 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -602,14 +602,14 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp TokenizerCrossValidator[.ad|.pos|.conllx|.namefinder|.parse] [-folds num] [-misclassified
- true|false] [-factory factoryName] [-abbDict path] [-alphaNumOpt isAlphaNumOpt] [-params paramsFile]
+Usage: opennlp TokenizerCrossValidator[.ad|.pos|.conllx|.namefinder|.parse] [-misclassified true|false]
+ [-folds num] [-factory factoryName] [-abbDict path] [-alphaNumOpt isAlphaNumOpt] [-params paramsFile]
-lang language -data sampleData [-encoding charsetName]
Arguments description:
- -folds num
- number of folds, default is 10.
-misclassified true|false
if true will print false negatives and false positives.
+ -folds num
+ number of folds, default is 10.
-factory factoryName
A sub-class of TokenizerFactory where to get implementation and resources.
-abbDict path
@@ -640,18 +640,18 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -769,18 +769,18 @@ Usage: opennlp TokenizerConverter help|ad|pos|conllx|namefinder|parse [help|opti
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -916,15 +916,15 @@ Usage: opennlp SentenceDetector model < sentences
<screen>
<![CDATA[
Usage: opennlp SentenceDetectorTrainer[.ad|.pos|.conllx|.namefinder|.parse|.moses|.letsmt] [-factory
- factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language -model modelFile
+ factoryName] [-abbDict path] [-eosChars string] [-params paramsFile] -lang language -model modelFile
-data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of SentenceDetectorFactory where to get implementation and resources.
- -eosChars string
- EOS characters.
-abbDict path
abbreviation dictionary in XML format.
+ -eosChars string
+ EOS characters.
-params paramsFile
training parameters file.
-lang language
@@ -951,18 +951,18 @@ Arguments description:
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -1089,13 +1089,13 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp SentenceDetectorEvaluator[.ad|.pos|.conllx|.namefinder|.parse|.moses|.letsmt] [-misclassified
- true|false] -model model -data sampleData [-encoding charsetName]
+Usage: opennlp SentenceDetectorEvaluator[.ad|.pos|.conllx|.namefinder|.parse|.moses|.letsmt] -model model
+ [-misclassified true|false] -data sampleData [-encoding charsetName]
Arguments description:
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
@@ -1116,18 +1116,18 @@ Arguments description:
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -1255,23 +1255,23 @@ Arguments description:
<screen>
<![CDATA[
Usage: opennlp SentenceDetectorCrossValidator[.ad|.pos|.conllx|.namefinder|.parse|.moses|.letsmt] [-factory
- factoryName] [-eosChars string] [-abbDict path] [-params paramsFile] -lang language [-folds num]
- [-misclassified true|false] -data sampleData [-encoding charsetName]
+ factoryName] [-abbDict path] [-eosChars string] [-params paramsFile] -lang language [-misclassified
+ true|false] [-folds num] -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of SentenceDetectorFactory where to get implementation and resources.
- -eosChars string
- EOS characters.
-abbDict path
abbreviation dictionary in XML format.
+ -eosChars string
+ EOS characters.
-params paramsFile
training parameters file.
-lang language
language which is being processed.
- -folds num
- number of folds, default is 10.
-misclassified true|false
if true will print false negatives and false positives.
+ -folds num
+ number of folds, default is 10.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
@@ -1292,18 +1292,18 @@ Arguments description:
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -1447,18 +1447,18 @@ Usage: opennlp SentenceDetectorConverter help|ad|pos|conllx|namefinder|parse|mos
<entry>Encoding for reading and writing text.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>includeTitles</entry>
<entry>includeTitles</entry>
<entry>Yes</entry>
<entry>If true will include sentences marked as headlines.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -1642,14 +1642,14 @@ Arguments description:
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
-<entry>lang</entry>
-<entry>it</entry>
+<entry>types</entry>
+<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,gpe</entry>
+<entry>lang</entry>
+<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -1673,18 +1673,18 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -1692,14 +1692,14 @@ Arguments description:
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
-<entry>lang</entry>
-<entry>en|de</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -1736,14 +1736,14 @@ Arguments description:
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
-<entry>lang</entry>
-<entry>es|nl</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>es|nl</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -1836,17 +1836,17 @@ Arguments description:
<screen>
<![CDATA[
Usage: opennlp TokenNameFinderEvaluator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
- [-nameTypes types] [-misclassified true|false] -model model [-detailedF true|false]
+ [-nameTypes types] -model model [-misclassified true|false] [-detailedF true|false]
[-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
Arguments description:
-nameTypes types
name types to use for evaluation
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-detailedF true|false
- if true will print detailed FMeasure results.
+ if true (default) will print detailed FMeasure results.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
@@ -1863,14 +1863,14 @@ Arguments description:
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
-<entry>lang</entry>
-<entry>it</entry>
+<entry>types</entry>
+<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,gpe</entry>
+<entry>lang</entry>
+<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -1894,18 +1894,18 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -1913,14 +1913,14 @@ Arguments description:
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
-<entry>lang</entry>
-<entry>en|de</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -1957,14 +1957,14 @@ Arguments description:
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
-<entry>lang</entry>
-<entry>es|nl</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>es|nl</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -2059,8 +2059,8 @@ Arguments description:
Usage: opennlp
TokenNameFinderCrossValidator[.evalita|.ad|.conll03|.bionlp2004|.conll02|.muc6|.ontonotes|.brat]
[-factory factoryName] [-resources resourcesDir] [-type modelType] [-featuregen featuregenFile]
- [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language [-folds num]
- [-misclassified true|false] [-detailedF true|false] [-reportOutputFile outputFile] -data sampleData
+ [-nameTypes types] [-sequenceCodec codec] [-params paramsFile] -lang language [-misclassified
+ true|false] [-folds num] [-detailedF true|false] [-reportOutputFile outputFile] -data sampleData
[-encoding charsetName]
Arguments description:
-factory factoryName
@@ -2079,12 +2079,12 @@ Arguments description:
training parameters file.
-lang language
language which is being processed.
- -folds num
- number of folds, default is 10.
-misclassified true|false
if true will print false negatives and false positives.
+ -folds num
+ number of folds, default is 10.
-detailedF true|false
- if true will print detailed FMeasure results.
+ if true (default) will print detailed FMeasure results.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
@@ -2101,14 +2101,14 @@ Arguments description:
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
-<entry>lang</entry>
-<entry>it</entry>
+<entry>types</entry>
+<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,gpe</entry>
+<entry>lang</entry>
+<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -2132,18 +2132,18 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -2151,14 +2151,14 @@ Arguments description:
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
-<entry>lang</entry>
-<entry>en|de</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -2195,14 +2195,14 @@ Arguments description:
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
-<entry>lang</entry>
-<entry>es|nl</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>es|nl</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -2305,14 +2305,14 @@ Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll
<tbody>
<row>
<entry morerows='3' valign='middle'>evalita</entry>
-<entry>lang</entry>
-<entry>it</entry>
+<entry>types</entry>
+<entry>per,loc,org,gpe</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,gpe</entry>
+<entry>lang</entry>
+<entry>it</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -2336,18 +2336,18 @@ Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>splitHyphenatedTokens</entry>
<entry>split</entry>
<entry>Yes</entry>
<entry>If true all hyphenated tokens will be separated (default true)</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -2355,14 +2355,14 @@ Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll
</row>
<row>
<entry morerows='3' valign='middle'>conll03</entry>
-<entry>lang</entry>
-<entry>en|de</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>eng|deu</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -2399,14 +2399,14 @@ Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll
</row>
<row>
<entry morerows='3' valign='middle'>conll02</entry>
-<entry>lang</entry>
-<entry>es|nl</entry>
+<entry>types</entry>
+<entry>per,loc,org,misc</entry>
<entry>No</entry>
<entry></entry>
</row>
<row>
-<entry>types</entry>
-<entry>per,loc,org,misc</entry>
+<entry>lang</entry>
+<entry>es|nl</entry>
<entry>No</entry>
<entry></entry>
</row>
@@ -2498,13 +2498,13 @@ Usage: opennlp TokenNameFinderConverter help|evalita|ad|conll03|bionlp2004|conll
<screen>
<![CDATA[
-Usage: opennlp CensusDictionaryCreator [-encoding charsetName] [-lang code] -dict dict -censusData censusDict
+Usage: opennlp CensusDictionaryCreator [-encoding charsetName] [-lang code] -censusData censusDict -dict dict
Arguments description:
-encoding charsetName
-lang code
- -dict dict
-censusData censusDict
+ -dict dict
]]>
</screen>
@@ -2538,19 +2538,18 @@ Usage: opennlp POSTagger model < sentences
<screen>
<![CDATA[
-Usage: opennlp POSTaggerTrainer[.ad|.conllx|.parse|.ontonotes] [-factory factoryName] [-type
- maxent|perceptron|perceptron_sequence] [-dict dictionaryPath] [-ngram cutoff] [-tagDictCutoff
- tagDictCutoff] [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding
- charsetName]
+Usage: opennlp POSTaggerTrainer[.ad|.conllx|.parse|.ontonotes|.conllu] [-factory factoryName] [-resources
+ resourcesDir] [-featuregen featuregenFile] [-dict dictionaryPath] [-tagDictCutoff tagDictCutoff]
+ [-params paramsFile] -lang language -model modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
A sub-class of POSTaggerFactory where to get implementation and resources.
- -type maxent|perceptron|perceptron_sequence
- The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.
+ -resources resourcesDir
+ The resources directory
+ -featuregen featuregenFile
+ The feature generator descriptor file
-dict dictionaryPath
The XML tag dictionary file
- -ngram cutoff
- NGram cutoff. If not specified will not create ngram dictionary.
-tagDictCutoff tagDictCutoff
TagDictionary cutoff. If specified will create/expand a mutable TagDictionary
-params paramsFile
@@ -2579,12 +2578,6 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
@@ -2597,6 +2590,12 @@ Arguments description:
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -2635,6 +2634,25 @@ Arguments description:
<entry>No</entry>
<entry></entry>
</row>
+<row>
+<entry morerows='2' valign='middle'>conllu</entry>
+<entry>tagset</entry>
+<entry>tagset</entry>
+<entry>Yes</entry>
+<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
+</row>
+<row>
+<entry>data</entry>
+<entry>sampleData</entry>
+<entry>No</entry>
+<entry>Data to be used, usually a file name.</entry>
+</row>
+<row>
+<entry>encoding</entry>
+<entry>charsetName</entry>
+<entry>Yes</entry>
+<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
+</row>
</tbody>
</tgroup></informaltable>
@@ -2648,13 +2666,13 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp POSTaggerEvaluator[.ad|.conllx|.parse|.ontonotes] [-misclassified true|false] -model model
- [-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
+Usage: opennlp POSTaggerEvaluator[.ad|.conllx|.parse|.ontonotes|.conllu] -model model [-misclassified
+ true|false] [-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
Arguments description:
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
@@ -2677,12 +2695,6 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
@@ -2695,6 +2707,12 @@ Arguments description:
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -2733,6 +2751,25 @@ Arguments description:
<entry>No</entry>
<entry></entry>
</row>
+<row>
+<entry morerows='2' valign='middle'>conllu</entry>
+<entry>tagset</entry>
+<entry>tagset</entry>
+<entry>Yes</entry>
+<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
+</row>
+<row>
+<entry>data</entry>
+<entry>sampleData</entry>
+<entry>No</entry>
+<entry>Data to be used, usually a file name.</entry>
+</row>
+<row>
+<entry>encoding</entry>
+<entry>charsetName</entry>
+<entry>Yes</entry>
+<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
+</row>
</tbody>
</tgroup></informaltable>
@@ -2746,23 +2783,23 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp POSTaggerCrossValidator[.ad|.conllx|.parse|.ontonotes] [-folds num] [-misclassified
- true|false] [-factory factoryName] [-type maxent|perceptron|perceptron_sequence] [-dict
- dictionaryPath] [-ngram cutoff] [-tagDictCutoff tagDictCutoff] [-params paramsFile] -lang language
- [-reportOutputFile outputFile] -data sampleData [-encoding charsetName]
+Usage: opennlp POSTaggerCrossValidator[.ad|.conllx|.parse|.ontonotes|.conllu] [-misclassified true|false]
+ [-folds num] [-factory factoryName] [-resources resourcesDir] [-featuregen featuregenFile] [-dict
+ dictionaryPath] [-tagDictCutoff tagDictCutoff] [-params paramsFile] -lang language [-reportOutputFile
+ outputFile] -data sampleData [-encoding charsetName]
Arguments description:
- -folds num
- number of folds, default is 10.
-misclassified true|false
if true will print false negatives and false positives.
+ -folds num
+ number of folds, default is 10.
-factory factoryName
A sub-class of POSTaggerFactory where to get implementation and resources.
- -type maxent|perceptron|perceptron_sequence
- The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.
+ -resources resourcesDir
+ The resources directory
+ -featuregen featuregenFile
+ The feature generator descriptor file
-dict dictionaryPath
The XML tag dictionary file
- -ngram cutoff
- NGram cutoff. If not specified will not create ngram dictionary.
-tagDictCutoff tagDictCutoff
TagDictionary cutoff. If specified will create/expand a mutable TagDictionary
-params paramsFile
@@ -2791,12 +2828,6 @@ Arguments description:
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
@@ -2809,6 +2840,12 @@ Arguments description:
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -2847,6 +2884,25 @@ Arguments description:
<entry>No</entry>
<entry></entry>
</row>
+<row>
+<entry morerows='2' valign='middle'>conllu</entry>
+<entry>tagset</entry>
+<entry>tagset</entry>
+<entry>Yes</entry>
+<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
+</row>
+<row>
+<entry>data</entry>
+<entry>sampleData</entry>
+<entry>No</entry>
+<entry>Data to be used, usually a file name.</entry>
+</row>
+<row>
+<entry>encoding</entry>
+<entry>charsetName</entry>
+<entry>Yes</entry>
+<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
+</row>
</tbody>
</tgroup></informaltable>
@@ -2856,11 +2912,11 @@ Arguments description:
<title>POSTaggerConverter</title>
-<para>Converts foreign data formats (ad,conllx,parse,ontonotes) to native OpenNLP format</para>
+<para>Converts foreign data formats (ad,conllx,parse,ontonotes,conllu) to native OpenNLP format</para>
<screen>
<![CDATA[
-Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes [help|options...]
+Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes|conllu [help|options...]
]]>
</screen>
@@ -2877,12 +2933,6 @@ Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes [help|options..
<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
</row>
<row>
-<entry>lang</entry>
-<entry>language</entry>
-<entry>No</entry>
-<entry>Language which is being processed.</entry>
-</row>
-<row>
<entry>expandME</entry>
<entry>expandME</entry>
<entry>Yes</entry>
@@ -2895,6 +2945,12 @@ Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes [help|options..
<entry>Combine POS Tags with word features, like number and gender.</entry>
</row>
<row>
+<entry>lang</entry>
+<entry>language</entry>
+<entry>No</entry>
+<entry>Language which is being processed.</entry>
+</row>
+<row>
<entry>data</entry>
<entry>sampleData</entry>
<entry>No</entry>
@@ -2933,6 +2989,25 @@ Usage: opennlp POSTaggerConverter help|ad|conllx|parse|ontonotes [help|options..
<entry>No</entry>
<entry></entry>
</row>
+<row>
+<entry morerows='2' valign='middle'>conllu</entry>
+<entry>tagset</entry>
+<entry>tagset</entry>
+<entry>Yes</entry>
+<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
+</row>
+<row>
+<entry>data</entry>
+<entry>sampleData</entry>
+<entry>No</entry>
+<entry>Data to be used, usually a file name.</entry>
+</row>
+<row>
+<entry>encoding</entry>
+<entry>charsetName</entry>
+<entry>Yes</entry>
+<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
+</row>
</tbody>
</tgroup></informaltable>
@@ -2966,7 +3041,7 @@ Usage: opennlp LemmatizerME model < sentences
<screen>
<![CDATA[
-Usage: opennlp LemmatizerTrainerME [-factory factoryName] [-params paramsFile] -lang language -model
+Usage: opennlp LemmatizerTrainerME[.conllu] [-factory factoryName] [-params paramsFile] -lang language -model
modelFile -data sampleData [-encoding charsetName]
Arguments description:
-factory factoryName
@@ -2989,6 +3064,25 @@ Arguments description:
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
+<row>
+<entry morerows='2' valign='middle'>conllu</entry>
+<entry>tagset</entry>
+<entry>tagset</entry>
+<entry>Yes</entry>
+<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
+</row>
+<row>
+<entry>data</entry>
+<entry>sampleData</entry>
+<entry>No</entry>
+<entry>Data to be used, usually a file name.</entry>
+</row>
+<row>
+<entry>encoding</entry>
+<entry>charsetName</entry>
+<entry>Yes</entry>
+<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
+</row>
</tbody>
</tgroup></informaltable>
@@ -3002,13 +3096,13 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp LemmatizerEvaluator [-misclassified true|false] -model model [-reportOutputFile outputFile]
- -data sampleData [-encoding charsetName]
+Usage: opennlp LemmatizerEvaluator[.conllu] -model model [-misclassified true|false] [-reportOutputFile
+ outputFile] -data sampleData [-encoding charsetName]
Arguments description:
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-reportOutputFile outputFile
the path of the fine-grained report file.
-data sampleData
@@ -3023,6 +3117,25 @@ Arguments description:
<informaltable frame='all'><tgroup cols='4' align='left' colsep='1' rowsep='1'>
<thead><row><entry>Format</entry><entry>Argument</entry><entry>Value</entry><entry>Optional</entry><entry>Description</entry></row></thead>
<tbody>
+<row>
+<entry morerows='2' valign='middle'>conllu</entry>
+<entry>tagset</entry>
+<entry>tagset</entry>
+<entry>Yes</entry>
+<entry>U|x u for unified tags and x for language-specific part-of-speech tags</entry>
+</row>
+<row>
+<entry>data</entry>
+<entry>sampleData</entry>
+<entry>No</entry>
+<entry>Data to be used, usually a file name.</entry>
+</row>
+<row>
+<entry>encoding</entry>
+<entry>charsetName</entry>
+<entry>Yes</entry>
+<entry>Encoding for reading and writing text, if absent the system default is used.</entry>
+</row>
</tbody>
</tgroup></informaltable>
@@ -3123,15 +3236,15 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp ChunkerEvaluator[.ad] [-misclassified true|false] -model model [-detailedF true|false] -data
+Usage: opennlp ChunkerEvaluator[.ad] -model model [-misclassified true|false] [-detailedF true|false] -data
sampleData [-encoding charsetName]
Arguments description:
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-detailedF true|false
- if true will print detailed FMeasure results.
+ if true (default) will print detailed FMeasure results.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
@@ -3188,8 +3301,9 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp ChunkerCrossValidator[.ad] [-factory factoryName] [-params paramsFile] -lang language [-folds
- num] [-misclassified true|false] [-detailedF true|false] -data sampleData [-encoding charsetName]
+Usage: opennlp ChunkerCrossValidator[.ad] [-factory factoryName] [-params paramsFile] -lang language
+ [-misclassified true|false] [-folds num] [-detailedF true|false] -data sampleData [-encoding
+ charsetName]
Arguments description:
-factory factoryName
A sub-class of ChunkerFactory where to get implementation and resources.
@@ -3197,12 +3311,12 @@ Arguments description:
training parameters file.
-lang language
language which is being processed.
- -folds num
- number of folds, default is 10.
-misclassified true|false
if true will print false negatives and false positives.
+ -folds num
+ number of folds, default is 10.
-detailedF true|false
- if true will print detailed FMeasure results.
+ if true (default) will print detailed FMeasure results.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
@@ -3399,13 +3513,13 @@ Arguments description:
<screen>
<![CDATA[
-Usage: opennlp ParserEvaluator[.ontonotes|.frenchtreebank] [-misclassified true|false] -model model -data
+Usage: opennlp ParserEvaluator[.ontonotes|.frenchtreebank] -model model [-misclassified true|false] -data
sampleData [-encoding charsetName]
Arguments description:
- -misclassified true|false
- if true will print false negatives and false positives.
-model model
the model file to be evaluated.
+ -misclassified true|false
+ if true will print false negatives and false positives.
-data sampleData
data to be used, usually a file name.
-encoding charsetName
@@ -3633,15 +3747,15 @@ Usage: opennlp EntityLinker model < sentences
<title>Languagemodel</title>
-<section id='tools.cli.languagemodel.LanguageModel'>
+<section id='tools.cli.languagemodel.NGramLanguageModel'>
-<title>LanguageModel</title>
+<title>NGramLanguageModel</title>
-<para>Gives the probability of a sequence of tokens in a language model</para>
+<para>Gives the probability and most probable next token(s) of a sequence of tokens in a language model</para>
<screen>
<![CDATA[
-Usage: opennlp LanguageModel model
+Usage: opennlp NGramLanguageModel model
]]>
</screen>