You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/01/28 17:08:42 UTC
svn commit: r1064754 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
Author: joern
Date: Fri Jan 28 16:08:41 2011
New Revision: 1064754
URL: http://svn.apache.org/viewvc?rev=1064754&view=rev
Log:
OPENNLP-64 Added a part of training documentation
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml?rev=1064754&r1=1064753&r2=1064754&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml Fri Jan 28 16:08:41 2011
@@ -131,4 +131,44 @@ Sequence topSequences[] = tagger.topKSeq
</para>
</section>
</section>
+ <section id="tools.postagger.training">
+ <title>Training</title>
+ <para>
+ The POS Tagger can be trained on annotated training material. The training material
+ is a collection of tokenized sentences where each token has the assigned part-of-speech tag.
+ The native POS Tagger training material looks like this:
+ <programlisting>
+ <![CDATA[
+About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
+That_DT sounds_VBZ good_JJ ._.]]>
+ </programlisting>
+ Each sentence must be in one line. The token/tag pairs are combined with "_".
+ The token/tag pairs are whitespace separated. The data format does not
+ define a document boundary. If a document boundary should be included in the
+ training material it is suggested to use an empty line.
+ </para>
+ <para>The Part-of-Speech Tagger can eihter be trained with a command line tool,
+ or via an trainng API.
+ </para>
+ <section id="tools.postagger.training.tool">
+ <title>Training Tool</title>
+ <para>
+ OpenNLP has a command line tool which is used to train the models available from the model
+ download page on various corpora.
+ </para>
+ <para>
+ Usage of the tool:
+ <screen>
+ <![CDATA[
+$ bin/opennlp POSTaggerTrainer
+Usage: opennlp POSTaggerTrainer -lang language -encoding charset [-iterations num] [-cutoff num] \
+ [-dict tagdict] [-model maxent|perceptron|perceptron_sequence] -data trainingData -model model
+-lang language specifies the language which is being processed.
+-encoding charset specifies the encoding which should be used for reading and writing text.
+-iterations num specified the number of training iterations
+-cutoff num specifies the min number of times a feature must be seen]]>
+ </screen>
+ </para>
+ </section>
+ </section>
</chapter>
\ No newline at end of file