You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/07/11 14:38:29 UTC

svn commit: r1145149 - /incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml

Author: joern
Date: Mon Jul 11 12:38:29 2011
New Revision: 1145149

URL: http://svn.apache.org/viewvc?rev=1145149&view=rev
Log:
OPENNLP-215 Added note to add contribution, and did a little restructuring to fit in the training api section in a consistent way.

Modified:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml

Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml?rev=1145149&r1=1145148&r2=1145149&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml Mon Jul 11 12:38:29 2011
@@ -79,7 +79,7 @@ A form of asbestos once used to make Ken
 			each
 			sentence are identified.
 		</para>
-	</section>
+	
 	<section id="tools.tokenizer.cmdline">
 		<title>Tokenizer Tools</title>
 		<para>The easiest way to try out the tokenizers are the command line
@@ -221,17 +221,22 @@ double tokenProbs[] = tokenizer.getToken
 			and 0 the lowest possible probability.
 		</para>
 	</section>
-	<section id="tools.tokenizer.cmdline.training">
-		<title>Training Tool</title>
-		<para>
-			OpenNLP has a command line tool which is used to train the models
-			available from the model download page on various corpora. The data
-			must be converted to the OpenNLP Tokenizer training format. Which is
-			one sentence per line. Tokens are either separater by a whitespace or
-			if by a special &lt;SPLIT&gt; tag.
+	</section>
+	
+	<section id="tools.tokenizer.training">
+		<title>Tokenizer Training</title>
 			
-			The following sample shows the sample from above in the correct format.
-						<programlisting>
+		<section id="tools.tokenizer.training.tool">
+			<title>Training Tool</title>
+			<para>
+				OpenNLP has a command line tool which is used to train the models
+				available from the model download page on various corpora. The data
+				must be converted to the OpenNLP Tokenizer training format. Which is
+				one sentence per line. Tokens are either separater by a whitespace or
+				if by a special &lt;SPLIT&gt; tag.
+				
+				The following sample shows the sample from above in the correct format.
+				<programlisting>
 			<![CDATA[
 Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>.
 Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>.
@@ -251,9 +256,9 @@ Usage: opennlp TokenizerTrainer-lang lan
 -cutoff num        specifies the min number of times a feature must be seen
 -alphaNumOpt Optimization flag to skip alpha numeric tokens for further tokenization
 			]]>		
-			</screen>
-			To train the english tokenizer use the following command:
-			<screen>
+				</screen>
+				To train the english tokenizer use the following command:
+				<screen>
 			<![CDATA[
 $ bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt \ 
 +-data en-token.train -model en-token.bin
@@ -288,9 +293,17 @@ Performing 100 iterations.
 Wrote tokenizer model.
 Path: en-token.bin
 			]]>		
-			</screen>
-		</para>
+				</screen>
+			</para>
+		</section>
+		<section id="tools.tokenizer.training.api">
+			<title>Training API</title>
+			<para>TODO: Write documentation about the tokenizer training api. Any contributions
+are very welcome. If you want to contribute please contact us on the mailing list
+or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-215">OPENNLP-215</ulink>.</para>
+		</section>
 	</section>
+	
 	<section id="tools.tokenizer.detokenizing">
 		<title>Detokenizing</title>
 		<para>