You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/01/25 12:51:28 UTC

svn commit: r1063240 - /incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml

Author: joern
Date: Tue Jan 25 11:51:28 2011
New Revision: 1063240

URL: http://svn.apache.org/viewvc?rev=1063240&view=rev
Log:
OPENNLP-79 Added Leipzig Corpora documentation

Modified:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml

Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml?rev=1063240&r1=1063239&r2=1063240&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml Tue Jan 25 11:51:28 2011
@@ -305,4 +305,95 @@ F-Measure: 0.7717879983140168]]>
 			</para>
 		</section>
 	</section>
+	<section id="tools.corpora.conll">
+	<title>Leipzig Corpora</title>
+	<para>
+	The Leiopzig Corpora collection presents corpora in different languages. The corpora is a collection of individual sentences collected
+	from the web and newspapers. The Corpora is available as plain text and as MySQL database tables. The OpenNLP integration can only
+	use the plain text version.
+	</para>
+	<para>
+	The corpora in the different languages can be used to train a document categorizer model which can detect the document language. 
+	The	individual plain text packages can be downlaoded here:
+	<ulink url="http://corpora.uni-leipzig.de/download.html">http://corpora.uni-leipzig.de/download.html</ulink>
+	</para>
+	
+	<para>
+	Afer all packages have been downloaded, unzip them and use the following commands to
+	produce a training file which can be processed by the Document Categorizer:
+	<programlisting>
+			<![CDATA[
+bin/opennlp DoccatConverter leipzig -lang cat -data Leipzig/cat100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang de -data Leipzig/de100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang dk -data Leipzig/dk100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang ee -data Leipzig/ee100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang en -data Leipzig/en100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang fi -data Leipzig/fi100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang fr -data Leipzig/fr100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang it -data Leipzig/it100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang jp -data Leipzig/jp100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang kr -data Leipzig/kr100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang nl -data Leipzig/nl100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang no -data Leipzig/no100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang se -data Leipzig/se100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang sorb -data Leipzig/sorb100k/sentences.txt >> lang.train
+bin/opennlp DoccatConverter leipzig -lang tr -data Leipzig/tr100k/sentences.txt >> lang.train]]>
+	</programlisting>
+	</para>
+	<para>
+	Depending on your platform local it might be problemmatic to output characters which are not supported by that encoding,
+	we suggest to run these command on a platform which has a unicode default encoding, e.g. Linux with UTF-8.
+	</para>
+	<para>
+	Afer the lang.train file is created the actual language detection document categorizer model
+	can be created with the following command.
+	<programlisting>
+			<![CDATA[
+bin/opennlp DoccatTrainer -lang x-unspecified -encoding MacRoman -data ../lang.train -model lang.model
+Indexing events using cutoff of 5
+
+	Computing event counts...  done. 10000 events
+	Indexing...  done.
+Sorting and merging events... done. Reduced 10000 events to 10000.
+Done indexing.
+Incorporating indexed data for training...  
+done.
+	Number of Event Tokens: 10000
+	    Number of Outcomes: 2
+	  Number of Predicates: 42730
+...done.
+Computing model parameters...
+Performing 100 iterations.
+  1:  .. loglikelihood=-6931.471805600547	0.5
+  2:  .. loglikelihood=-2110.9654348555955	1.0
+... cut lots of iterations ...
+
+ 99:  .. loglikelihood=-0.449640418555347	1.0
+100:  .. loglikelihood=-0.443746359746235	1.0
+Writing document categorizer model ... done (1.210s)
+
+Wrote document categorizer model to
+path: /Users/joern/dev/opennlp-apache/opennlp/opennlp-tools/lang.model
+]]>
+	</programlisting>
+	In the sample above the language detection model was trained to distinguish two languages, danish and english.
+	</para>
+	
+	<para>
+	After the model is created it can be used to detect the two languages:
+	
+	<programlisting>
+			<![CDATA[
+$ bin/opennlp Doccat ../lang.
+lang.model  lang.train  
+karkand:opennlp-tools joern$ bin/opennlp Doccat ../lang.model
+Loading Document Categorizer model ... done (0.289s)
+The American Finance Association is pleased to announce the award of ...
+en	The American Finance Association is pleased to announce the award of ..
+.
+Danskerne skal betale for den økonomiske krise ved at blive længere på arbejdsmarkedet .
+dk	Danskerne skal betale for den økonomiske krise ved at blive længere på arbejdsmarkedet .]]>	
+	</programlisting>
+	</para>
+	</section>
 </chapter>
\ No newline at end of file