You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/01/31 20:30:30 UTC
svn commit: r1065721 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
Author: joern
Date: Mon Jan 31 19:30:30 2011
New Revision: 1065721
URL: http://svn.apache.org/viewvc?rev=1065721&view=rev
Log:
OPENNLP-64 Added section about training api and evaluation.
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml?rev=1065721&r1=1065720&r2=1065721&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml Mon Jan 31 19:30:30 2011
@@ -137,7 +137,7 @@ Sequence topSequences[] = tagger.topKSeq
The POS Tagger can be trained on annotated training material. The training material
is a collection of tokenized sentences where each token has the assigned part-of-speech tag.
The native POS Tagger training material looks like this:
- <programlisting>
+ <programlisting>
<![CDATA[
About_IN 10_CD Euro_NNP ,_, I_PRP reckon_VBP ._.
That_DT sounds_VBZ good_JJ ._.]]>
@@ -147,9 +147,10 @@ That_DT sounds_VBZ good_JJ ._.]]>
define a document boundary. If a document boundary should be included in the
training material it is suggested to use an empty line.
</para>
- <para>The Part-of-Speech Tagger can eihter be trained with a command line tool,
+ <para>The Part-of-Speech Tagger can either be trained with a command line tool,
or via an trainng API.
</para>
+
<section id="tools.postagger.training.tool">
<title>Training Tool</title>
<para>
@@ -169,6 +170,133 @@ Usage: opennlp POSTaggerTrainer -lang la
-cutoff num specifies the min number of times a feature must be seen]]>
</screen>
</para>
+ <para>
+ The following command illustrates how an english part-of-speech model can be trained:
+ <screen>
+ <![CDATA[
+$bin/opennlp POSTaggerTrainer -encoding UTF-8 -lang en -model-type maxent -data en-pos.train -model en-pos-maxent.bin]]>
+ </screen>
+ </para>
+ </section>
+ <section id="tools.postagger.training.api">
+ <title>Training API</title>
+ <para>
+ The Part-of-Speech Tagger training API supports the programmatically training of a new pos model.
+ Basically three steps are necessary to train it:
+ <itemizedlist>
+ <listitem>
+ <para>The application must open a sample data stream</para>
+ </listitem>
+ <listitem>
+ <para>Call the POSTagger.train method</para>
+ </listitem>
+ <listitem>
+ <para>Save the POSModel to a file or database</para>
+ </listitem>
+ </itemizedlist>
+ The following code illustrates that:
+ <programlisting language="java">
+ <![CDATA[
+POSModel model = null;
+
+InputStream dataIn = null;
+try {
+ dataIn = new FileInputStream("en-pos.train");
+ ObjectStream<String> lineStream =
+ new PlainTextByLineStream(dataIn, "UTF-8");
+ ObjectStream<POSSample> sampleStream = new WordTagSampleStream(lineStream);
+
+ model = POSTaggerME.train("en", sampleStream, ModelType.MAXENT,
+ null, null, 100, 5);
+}
+catch (IOException e) {
+ // Failed to read or parse training data, training failed
+ e.printStackTrace();
+}
+finally {
+ if (dataIn != null) {
+ try {
+ dataIn.close();
+ }
+ catch (IOException e) {
+ // Not an issue, training already finished.
+ // The exception should be logged and investigated
+ // if part of a production system.
+ e.printStackTrace();
+ }
+ }
+}]]>
+ </programlisting>
+ The above code performs the first two steps, opening the data and training
+ the model. The trained model must still be saved into an OutputStream, in
+ the sample below it is written into a file.
+ <programlisting language="java">
+ <![CDATA[
+OutputStream modelOut = null;
+try {
+ modelOut = new BufferedOutputStream(new FileOutputStream(modelFile));
+ model.serialize(modelOut);
+}
+catch (IOException e) {
+ // Failed to save model
+ e.printStackTrace();
+}
+finally {
+ if (modelOut != null) {
+ try {
+ modelOut.close();
+ }
+ catch (IOException e) {
+ // Failed to correctly save model.
+ // Written model might be invalid.
+ e.printStackTrace();
+ }
+}]]>
+ </programlisting>
+ </para>
+ </section>
+ <section id="tools.postagger.training.tagdict">
+ <title>Tag Dictionary</title>
+ <para>
+ The tag dicitionary is a word dictionary which specifies which tags a specific token can have. Using a tag
+ dictionary has two advantages, unappropriate tags can not been assigned to tokens in the dictionary and the
+ beam search algrotihm has to consider less possibilties and can search faster.
+ </para>
+ <para>
+ The dictionary is defined in a xml format and can be created and stored with the POSDictionary class.
+ Pleaes for now checkout the javadoc and source code of that class.
+ </para>
+ <para>Note: Contributions to extend this section are welcome. The format should be documented and
+ sample code should show how to use the dictionary.</para>
+ </section>
+ </section>
+
+ <section id="tools.postagger.eval">
+ <title>Evaluation</title>
+ <para>
+ The built in evaluation can measure the accuracy of the pos tagger.
+ The accuracy can be measured on a test data set or via cross validation.
+ </para>
+ <section id="tools.postagger.eval.tool">
+ <title>Evaluation Tool</title>
+ <para>
+ There is a command line tool to evaluate a given model on a test data set.
+ The command line tool currently does not support the cross validation
+ evaluation (contribution welcome).
+ The following command shows how the tool can be run:
+ <screen>
+ <![CDATA[
+$bin/opennlp POSTaggerEvaluator -encoding utf-8 -model pt.postagger.model -data pt.postagger.test]]>
+ </screen>
+ This will display the resulting accuracy score, e.g.:
+ <screen>
+ <![CDATA[
+Loading model ... done
+Evaluating ... done
+
+Accuracy: 0.9659110277825124]]>
+ </screen>
+ </para>
</section>
</section>
</chapter>
\ No newline at end of file