You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/01/31 21:14:41 UTC
svn commit: r1065739 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
Author: joern
Date: Mon Jan 31 20:14:40 2011
New Revision: 1065739
URL: http://svn.apache.org/viewvc?rev=1065739&view=rev
Log:
OPENNLP-111 Added section about detokenizer.
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml?rev=1065739&r1=1065738&r2=1065739&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml Mon Jan 31 20:14:40 2011
@@ -283,4 +283,65 @@ Path: en-token.bin
</screen>
</para>
</section>
+ <section id="tools.tokenizer.detokenizing">
+ <title>Detokenizing</title>
+ <para>
+ Detokenizing is simple the opposite of tokenization, the original non-tokenized string should
+ be constructed out of a token sequence. The OpenNLP implementation was created to undo the tokenization
+ of training data for the tokenizer. It can also be used to undo the tokenization of such a trained
+ tokenizer. The implementation is strictly rule based and defines how tokens should be attached
+ to a sentence wise character sequence.
+ </para>
+ <para>
+ The rule dictionary assign to every token an operation which describes how it should be attached
+ to one continous character sequence.
+ </para>
+ <para>
+ The following rules can be assigned to a token:
+ <itemizedlist>
+ <listitem>
+ <para>MERGE_TO_LEFT - Merges the token to the left side.</para>
+ </listitem>
+ <listitem>
+ <para>MERGE_TO_RIGHT - Merges the token to the righ side.</para>
+ </listitem>
+ <listitem>
+ <para>RIGHT_LEFT_MATCHING - Merges the token to the right side on first occurence
+ and to the left side on second occurence.</para>
+ </listitem>
+ </itemizedlist>
+
+ The following sample will illustrate how the detokenizer with a small
+ rule dictionary (illustration format, not the xml data format):
+ <programlisting>
+ <![CDATA[
+. MERGE_TO_LEFT
+" RIGHT_LEFT_MATCHING]]>
+ </programlisting>
+ The dictionary should be used to de-tokenize the following whitespace tokenized sentence:
+ <programlisting>
+ <![CDATA[
+He said " This is a test " .]]>
+ </programlisting>
+ The tokens would get these tags based on the dictionary:
+ <programlisting>
+ <![CDATA[
+He -> NO_OPERATION
+said -> NO_OPERATION
+" -> MERGE_TO_RIGHT
+This -> NO_OPERATION
+is -> NO_OPERATION
+a -> NO_OPERATION
+test -> NO_OPERATION
+" -> MERGE_TO_LEFT
+. -> MERGE_TO_LEFT]]>
+ </programlisting>
+ That will result in the following character sequence:
+ <programlisting>
+ <![CDATA[
+He said "This is a test".]]>
+ </programlisting>
+ TODO: Add documentation about the dictionary format and how to use the API. Contributions are welcome.
+ </para>
+ </section>
</chapter>
\ No newline at end of file