You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/05/31 12:51:40 UTC
svn commit: r1129615 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
Author: joern
Date: Tue May 31 10:51:39 2011
New Revision: 1129615
URL: http://svn.apache.org/viewvc?rev=1129615&view=rev
Log:
OPENNLP-194 Fixed too long lines
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml?rev=1129615&r1=1129614&r2=1129615&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/tokenizer.xml Tue May 31 10:51:39 2011
@@ -28,7 +28,8 @@
<![CDATA[
Pierre Vinken, 61 years old, will join the board as a nonexecutive director Nov. 29.
Mr. Vinken is chairman of Elsevier N.V., the Dutch publishing group.
-Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields PLC, was named a director of this British industrial conglomerate.
+Rudolph Agnew, 55 years old and former chairman of Consolidated Gold Fields
+ PLC, was named a director of this British industrial conglomerate.
]]>
</programlisting>
@@ -39,8 +40,11 @@ Rudolph Agnew, 55 years old and former c
<![CDATA[
Pierre Vinken , 61 years old , will join the board as a nonexecutive director Nov. 29 .
Mr. Vinken is chairman of Elsevier N.V. , the Dutch publishing group .
-Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC , was named a nonexecutive director of this British industrial conglomerate .
-A form of asbestos once used to make Kent cigarette filters has caused a high percentage of cancer deaths among a group of workers exposed to it more than 30 years ago , researchers reported .
+Rudolph Agnew , 55 years old and former chairman of Consolidated Gold Fields PLC ,
+ was named a nonexecutive director of this British industrial conglomerate .
+A form of asbestos once used to make Kent cigarette filters has caused a high
+ percentage of cancer deaths among a group of workers exposed to it more than 30 years ago ,
+ researchers reported .
]]>
</programlisting>
@@ -127,7 +131,8 @@ Showa Shell gained 20 to 1,570 and Mitsu
Sumitomo Metal Mining fell five yen to 692 and Nippon Mining added 15 to 960 .
Among other winners Wednesday was Nippon Shokubai , which was up 80 at 2,410 .
Marubeni advanced 11 to 890 .
-London share prices were bolstered largely by continued gains on Wall Street and technical factors affecting demand for London 's blue-chip stocks .
+London share prices were bolstered largely by continued gains on Wall Street and technical
+ factors affecting demand for London 's blue-chip stocks .
...etc...]]>
</screen>
Of course this is all on the command line. Many people use the models
@@ -230,14 +235,16 @@ double tokenProbs[] = tokenizer.getToken
<![CDATA[
Pierre Vinken<SPLIT>, 61 years old<SPLIT>, will join the board as a nonexecutive director Nov. 29<SPLIT>.
Mr. Vinken is chairman of Elsevier N.V.<SPLIT>, the Dutch publishing group<SPLIT>.
-Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>, was named a nonexecutive director of this British industrial conglomerate<SPLIT>.
+Rudolph Agnew<SPLIT>, 55 years old and former chairman of Consolidated Gold Fields PLC<SPLIT>,
+ was named a nonexecutive director of this British industrial conglomerate<SPLIT>.
]]>
</programlisting>
Usage of the tool:
<screen>
<![CDATA[
$ bin/opennlp TokenizerTrainer
-Usage: opennlp TokenizerTrainer-lang language -encoding charset [-iterations num] [-cutoff num] [-alphaNumOpt] -data trainingData -model model
+Usage: opennlp TokenizerTrainer-lang language -encoding charset [-iterations num] \
+[-cutoff num] [-alphaNumOpt] -data trainingData -model model
-lang language specifies the language which is being processed.
-encoding charset specifies the encoding which should be used for reading and writing text.
-iterations num specified the number of training iterations
@@ -248,7 +255,8 @@ Usage: opennlp TokenizerTrainer-lang lan
To train the english tokenizer use the following command:
<screen>
<![CDATA[
-$ bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt -data en-token.train -model en-token.bin
+$ bin/opennlp TokenizerTrainer -encoding UTF-8 -lang en -alphaNumOpt \
++-data en-token.train -model en-token.bin
Indexing events using cutoff of 5
Computing event counts... done. 262271 events