You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/01/28 14:14:37 UTC
svn commit: r1064658 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
Author: joern
Date: Fri Jan 28 13:14:36 2011
New Revision: 1064658
URL: http://svn.apache.org/viewvc?rev=1064658&view=rev
Log:
OPENNLP-64 Added a section about the tagger API.
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml?rev=1064658&r1=1064657&r2=1064658&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/postagger.xml Fri Jan 28 13:14:36 2011
@@ -27,9 +27,9 @@ under the License.
<title>Tagging</title>
<para>
The Part of Speech Tagger marks tokens with their corresponding word type
- based on the token itself and the context of the token. A token can have
+ based on the token itself and the context of the token. A token might have
multiple pos tags depending on the token and the context. The OpenNLP POS Tagger
- uses a probability model to guess the correct pos tag out of the tag set.
+ uses a probability model to predict the correct pos tag out of the tag set.
To limit the possible tags for a token a tag dictionary can be used which increases
the tagging and runtime performance of the tagger.
</para>
@@ -57,6 +57,78 @@ Mr._NNP Vinken_NNP is_VBZ chairman_NN of
</programlisting>
The tag set used by the english pos model is the Penn Treebank tag set. See the link below for a description of the tags.
</para>
- </section>
+ </section>
+
+ <section id="tools.postagger.tagging.api">
+ <title>POS Tagger API</title>
+ <para>
+ The POS Tagger can be embedded into an application via its API.
+ First the pos model must be loaded into memory from disk or an other source.
+ In the sample below its loaded from disk.
+ <programlisting language="java">
+ <![CDATA[
+InputStream modelIn = null;
+
+try {
+ modelIn = new FileInputStream("en-pos-maxent.bin");
+ POSModel model = new POSModel(modelIn);
+}
+catch (IOException e) {
+ // Model loading failed, handle the error
+ e.printStackTrace();
+}
+finally {
+ if (modelIn != null) {
+ try {
+ modelIn.close();
+ }
+ catch (IOException e) {
+ }
+ }
+}]]>
+ </programlisting>
+ After the model is loaded the POSTaggerME can be instantiated.
+ <programlisting language="java">
+ <![CDATA[
+POSTaggerME tagger = new POSTaggerME(model);]]>
+ </programlisting>
+ The POS Tagger instance is now ready to tag data. It expects a tokenized sentence
+ as input, which is represented as a String array, each String object in the array
+ is one token.
+ </para>
+ <para>
+ The following code shows how to determine the most likely pos tag sequence for a sentence.
+ <programlisting language="java">
+ <![CDATA[
+String sent[] = new String[]{"Most", "large", "cities", "in", "the", "US", "had",
+ "morning", "and", "afternoon", "newspapers", "."};
+String tags[] = tagger.tag(sent);]]>
+ </programlisting>
+ The tags array contains one part-of-speech tag for each token in the input array. The corresponding
+ tag can be found at the same index as the token has in the input array.
+ The confidence scores for the returned tags can be easily retrieved from
+ a POSTaggerME with the following method call:
+ <programlisting language="java">
+ <![CDATA[
+double probs[] = tagger.probs();]]>
+ </programlisting>
+ The call to probs is stateful and will always return the probabilities of the last
+ tagged sentence. The probs method should only be called when the tag method
+ was called before, otherwise the behavior is undefined.
+ </para>
+ <para>
+ Some applications need to retrieve the n-best pos tag sequences and not
+ only the best sequence.
+ The topKSequences method is capable of returning the top sequences.
+ It can be called in a similar way as tag.
+ <programlisting language="java">
+ <![CDATA[
+Sequence topSequences[] = tagger.topKSequences(sent);]]>
+ </programlisting>
+ Each Sequence object contains one sequence. The sequence can be retrieved
+ via Sequence.getOutcomes() which returns a tags array
+ and Sequence.getProbs() returns the probability array for this sequence.
+ </para>
+ </section>
</section>
</chapter>
\ No newline at end of file