You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/07/11 14:38:29 UTC
svn commit: r1145149 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
Author: joern
Date: Mon Jul 11 12:38:29 2011
New Revision: 1145149
URL: http://svn.apache.org/viewvc?rev=1145149&view=rev
Log:
OPENNLP-215 Added note to add contribution, and did a little restructuring to fit in the training api section in a consistent way.
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml?rev=1145149&r1=1145148&r2=1145149&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml Mon Jul 11 12:38:29 2011
@@ -79,7 +79,7 @@ A form of asbestos once used to make Ken
each
sentence are identified.
</para>
- </section>
+
<section id="tools.tokenizer.cmdline">
<title>Tokenizer Tools</title>
<para>The easiest way to try out the tokenizers are the command line
@@ -221,17 +221,22 @@ double tokenProbs[] = tokenizer.getToken
and 0 the lowest possible probability.
</para>
</section>
- <section id="tools.tokenizer.cmdline.training">
- <title>Training Tool</title>
- <para>
- OpenNLP has a command line tool which is used to train the models
- available from the model download page on various corpora. The data
- must be converted to the OpenNLP Tokenizer training format. Which is
- one sentence per line. Tokens are either separater by a whitespace or
- if by a special <SPLIT> tag.
+ </section>
+
+ <section id="tools.tokenizer.training">
+ <title>Tokenizer Training</title>
- The following sample shows the sample from above in the correct format.
- <programlisting>
+ <section id="tools.tokenizer.training.tool">
+ <title>Training Tool</title>
+ <para>
+ OpenNLP has a command line tool which is used to train the models
+ available from the model download page on various corpora. The data
+ must be converted to the OpenNLP Tokenizer training format. Which is
+ one sentence per line. Tokens are either separater by a whitespace or
+ if by a special <SPLIT> tag.
+
+ The following sample shows the sample from above in the correct format.
+ <programlisting>
<![CDATA[
Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>.
@@ -251,9 +256,9 @@ Usage: opennlp TokenizerTrainer-lang lan
-cutoff num specifies the min number of times a feature must be seen
-alphaNumOpt Optimization flag to skip alpha numeric tokens for further tokenization
]]>
- </screen>
- To train the english tokenizer use the following command:
- <screen>
+ </screen>
+ To train the english tokenizer use the following command:
+ <screen>
<![CDATA[
$ bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt \
+-data en-token.train -model en-token.bin
@@ -288,9 +293,17 @@ Performing 100 iterations.
Wrote tokenizer model.
Path: en-token.bin
]]>
- </screen>
- </para>
+ </screen>
+ </para>
+ </section>
+ <section id="tools.tokenizer.training.api">
+ <title>Training API</title>
+ <para>TODO: Write documentation about the tokenizer training api. Any contributions
+are very welcome. If you want to contribute please contact us on the mailing list
+or comment on the jira issue <ulink url="https://issues.apache.org/jira/browse/OPENNLP-215">OPENNLP-215</ulink>.</para>
+ </section>
</section>
+
<section id="tools.tokenizer.detokenizing">
<title>Detokenizing</title>
<para>