You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/01/28 17:08:42 UTC

svn commit: r1064754 - /incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml

Author: joern
Date: Fri Jan 28 16:08:41 2011
New Revision: 1064754

URL: http://svn.apache.org/viewvc?rev=1064754&view=rev
Log:
OPENNLP-64 Added a part of training documentation

Modified:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml

Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml?rev=1064754&r1=1064753&r2=1064754&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml Fri Jan 28 16:08:41 2011
@@ -131,4 +131,44 @@ Sequence topSequences[] = tagger.topKSeq
 	  		 </para>
 	</section>
 	</section>
+		<section id="tools.postagger.training">
+		<title>Training</title>
+		<para>
+			The POS Tagger can be trained on annotated training material. The training material
+			is a collection of tokenized sentences where each token has the assigned part-of-speech tag.
+			The native POS Tagger training material looks like this:
+						<programlisting>
+		  <![CDATA[
+About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
+That_DT sounds_VBZ good_JJ ._.]]>
+			</programlisting>		
+			Each sentence must be in one line. The token/tag pairs are combined with "_".
+			The token/tag pairs are whitespace separated. The data format does not
+			define a document boundary. If a document boundary should be included in the
+			training material it is suggested to use an empty line.
+		</para>
+		<para>The Part-of-Speech Tagger can eihter be trained with a command line tool,
+		or via an trainng API.
+		</para>
+		<section id="tools.postagger.training.tool">
+		<title>Training Tool</title>
+		<para>
+			OpenNLP has a command line tool which is used to train the models available from the model
+			download page on various corpora.
+		</para>
+		<para>
+		 Usage of the tool:
+			<screen>
+				<![CDATA[
+$ bin/opennlp POSTaggerTrainer
+Usage: opennlp POSTaggerTrainer -lang language -encoding charset [-iterations num] [-cutoff num] \ 
+    [-dict tagdict] [-model maxent|perceptron|perceptron_sequence] -data trainingData -model model
+-lang language     specifies the language which is being processed.
+-encoding charset  specifies the encoding which should be used for reading and writing text.
+-iterations num    specified the number of training iterations
+-cutoff num        specifies the min number of times a feature must be seen]]>
+			 </screen>
+		</para>
+		</section>
+		</section>
 </chapter>
\ No newline at end of file