You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by co...@apache.org on 2011/01/19 03:45:04 UTC
svn commit: r1060658 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
Author: colen
Date: Wed Jan 19 02:45:03 2011
New Revision: 1060658
URL: http://svn.apache.org/viewvc?rev=1060658&view=rev
Log:
OPENNLP-65 added CONLL 2000 section
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml?rev=1060658&r1=1060657&r2=1060658&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/corpora.xml Wed Jan 19 02:45:03 2011
@@ -37,6 +37,105 @@ under the License.
environment. More information about the entire conference series can be obtained here
for CoNLL.
</para>
+ <section id="tools.corpora.conll.2000">
+ <title>CONLL 2000</title>
+ <para>
+ The shared task of CoNLL-2000 is Chunking .
+ </para>
+ <section id="tools.corpora.conll.2000.getting">
+ <title>Getting the data</title>
+ <para>
+ CoNLL-2000 made available training and test data for the Chunk task in English.
+ The data consists of the same partitions of the Wall Street Journal corpus (WSJ)
+ as the widely used data for noun phrase chunking: sections 15-18 as training data
+ (211727 tokens) and section 20 as test data (47377 tokens). The annotation of the
+ data has been derived from the WSJ corpus by a program written by Sabine Buchholz
+ from Tilburg University, The Netherlands. Both training and test data can be
+ obtained from <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">http://www.cnts.ua.ac.be/conll2000/chunking</ulink>.
+ </para>
+ </section>
+ <section id="tools.corpora.conll.2000.converting">
+ <title>Converting the data</title>
+ <para>
+ The data don't need to be transformed because Apache OpenNLP Chunker follows
+ the CONLL 2000 format for training. Check <link linkend="tools.chunker.training">Chunker Training</link> section to learn more.
+ </para>
+ </section>
+ <section id="tools.corpora.conll.2000.training">
+ <title>Training</title>
+ <para>
+ We can train the model for the Chunker using the train.txt available at CONLL 2000:
+ <programlisting>
+ <![CDATA[
+bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang en -iterations 500 \
+-data train.txt -model en-chunker.bin]]>
+ </programlisting>
+ <programlisting>
+ <![CDATA[
+Indexing events using cutoff of 5
+
+ Computing event counts... done. 211727 events
+ Indexing... done.
+Sorting and merging events... done. Reduced 211727 events to 197252.
+Done indexing.
+Incorporating indexed data for training...
+done.
+ Number of Event Tokens: 197252
+ Number of Outcomes: 22
+ Number of Predicates: 107838
+...done.
+Computing model parameters...
+Performing 500 iterations.
+ 1: .. loglikelihood=-654457.1455212828 0.2601510435608118
+ 2: .. loglikelihood=-239513.5583724216 0.9260037690044255
+ 3: .. loglikelihood=-141313.1386347238 0.9443387003074715
+ 4: .. loglikelihood=-101083.50853437989 0.954375209585929
+... cut lots of iterations ...
+498: .. loglikelihood=-1710.8874647317095 0.9995040783650645
+499: .. loglikelihood=-1708.0908900815848 0.9995040783650645
+500: .. loglikelihood=-1705.3045902366732 0.9995040783650645
+Writing chunker model ... done (4.019s)
+
+Wrote chunker model to path: .\en-chunker.bin]]>
+ </programlisting>
+ </para>
+ </section>
+ <section id="tools.corpora.conll.2000.evaluation">
+ <title>Evaluating</title>
+ <para>
+ We evaluate the model using the file test.txt available at CONLL 2000:
+ <programlisting>
+ <![CDATA[
+$ bin/opennlp ChunkerEvaluator -encoding utf8 -model en-chunker.bin -data test.txt]]>
+ </programlisting>
+ <programlisting>
+ <![CDATA[
+Loading Chunker model ... done (0,665s)
+current: 85,8 sent/s avg: 85,8 sent/s total: 86 sent
+current: 88,1 sent/s avg: 87,0 sent/s total: 174 sent
+current: 156,2 sent/s avg: 110,0 sent/s total: 330 sent
+current: 192,2 sent/s avg: 130,5 sent/s total: 522 sent
+current: 167,2 sent/s avg: 137,8 sent/s total: 689 sent
+current: 179,2 sent/s avg: 144,6 sent/s total: 868 sent
+current: 183,2 sent/s avg: 150,3 sent/s total: 1052 sent
+current: 183,2 sent/s avg: 154,4 sent/s total: 1235 sent
+current: 169,2 sent/s avg: 156,0 sent/s total: 1404 sent
+current: 178,2 sent/s avg: 158,2 sent/s total: 1582 sent
+current: 172,2 sent/s avg: 159,4 sent/s total: 1754 sent
+current: 177,2 sent/s avg: 160,9 sent/s total: 1931 sent
+
+
+Average: 161,6 sent/s
+Total: 2013 sent
+Runtime: 12.457s
+
+Precision: 0.9244354736974896
+Recall: 0.9216837162502096
+F-Measure: 0.9230575441395671]]>
+ </programlisting>
+ </para>
+ </section>
+ </section>
<section id="tools.corpora.conll.2003">
<title>CONLL 2003</title>
<para>