You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by co...@apache.org on 2011/01/13 18:14:05 UTC

svn commit: r1058668 - in /incubator/opennlp/trunk/opennlp-docs/src/docbkx: chunker.xml opennlp.xml

Author: colen
Date: Thu Jan 13 17:14:04 2011
New Revision: 1058668

URL: http://svn.apache.org/viewvc?rev=1058668&view=rev
Log:
OPENNLP-61 Created Chunk tool documentantion

Added:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml   (with props)
Modified:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml

Added: incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml?rev=1058668&view=auto
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml (added)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml Thu Jan 13 17:14:04 2011
@@ -0,0 +1,178 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+<chapter id="tools.chunker">
+
+	<title>Chunker</title>
+
+	<section id="tools.parser.chunking">
+		<title>Chunking</title>
+		<para>
+		Text chunking consists of dividing a text in syntactically correlated parts of words, like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence. 
+		</para>
+		
+		<section id="tools.parser.chunking.cmdline">
+		<title>Chunker Tool</title>
+		<para>
+		The easiest way to try out the Chunker is the command line tool. The tool is only intended for demonstration and testing.
+		</para> 
+		<para>
+		Download the english maxent chunker model from the website and start the Chunker Tool with this command:
+		</para>
+		<para>
+				<programlisting>
+				<![CDATA[
+bin/opennlp ChunkerME en-chunker.bin]]>
+		</programlisting>
+		The Chunker now reads a pos tagged sentence per line from stdin.
+		Copy these two sentences to the console: 
+		<programlisting>
+				<![CDATA[
+Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.
+Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._.]]>
+		</programlisting>
+		the Chunker will now echo the sentences grouped tokens to the console:
+				<programlisting>
+				<![CDATA[
+[NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ] [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ] [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ] [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.
+[NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ] [NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ] [PP for_IN ] [NP the_DT planes_NNS ] ._.]]>
+		</programlisting>
+		The tag set used by the english pos model is the Penn Treebank tag set. See the link below for a description of the tags.
+		</para>
+		</section>
+		<section id="tools.parser.chunking.api">
+		<title>Chunking API</title>
+		<para>
+			TODO
+		</para>
+		</section>
+	</section>
+	<section id="tools.chunker.training">
+		<title>Chunker Training</title>
+		<para>
+		The pre-trained models might not be available for a desired language, can not detect important entities or the performance is not good enough outside the news domain.
+		</para>
+		<para>
+		These are the typical reason to do custom training of the chunker on a new corpus or on a corpus which is extended by private training data taken from the data which should be analyzed.
+		</para>
+		<para>
+		The training data must be converted to the OpenNLP chunker training format, that is based on <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>: The train data consist of three columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag and the third its chunk tag. The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of the file format:
+		</para>
+		<para>
+		Sample sentence of the training data: 
+		<programlisting>
+				<![CDATA[
+He        PRP  B-NP
+reckons   VBZ  B-VP
+the       DT   B-NP
+current   JJ   I-NP
+account   NN   I-NP
+deficit   NN   I-NP
+will      MD   B-VP
+narrow    VB   I-VP
+to        TO   B-PP
+only      RB   B-NP
+#         #    I-NP
+1.8       CD   I-NP
+billion   CD   I-NP
+in        IN   B-PP
+September NNP  B-NP
+.         .    O]]>
+		</programlisting>
+		</para>
+		<section id="tools.chunker.training.tool">
+		<title>Training Tool</title>
+		<para>
+		OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora.
+		</para> 
+		<para>
+		Usage of the tool:
+				<programlisting>
+				<![CDATA[
+$ bin/opennlp ChunkerTrainerME
+Usage: opennlp ChunkerTrainerME-lang language -encoding charset [-iterations num] [-cutoff num] -data trainingData -model model
+-lang language     specifies the language which is being processed.
+-encoding charset  specifies the encoding which should be used for reading and writing text.
+-iterations num    specified the number of training iterations
+-cutoff num        specifies the min number of times a feature must be seen]]>
+		</programlisting>
+		Its now assumed that the english chunker model should be trained from a file called en-chunker.train which is encoded as UTF-8. The following command will train the name finder and write the model to en-chunker.bin: 
+		<programlisting>
+		<![CDATA[
+bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang en -data en-chunker.train -model en-chunker.bin]]>
+		</programlisting>
+		Additionally its possible to specify the number of iterations, the cutoff and to overwrite all types in the training data with a single type.
+		</para>
+		</section>
+	</section>
+	
+	<section id="tools.chunker.evaluation">
+		<title>Chunker Evaluation</title>
+		<para>
+		(only OpenNLP 1.5.1-SNAPSHOT or better)
+		</para>
+		<para>
+		The built in evaluation can measure the chunker performance. The performance is either measured on a test dataset or via cross validation. 
+		</para>
+		<section id="tools.chunker.evaluation.tool">
+		<title>Chunker Evaluation Tool</title>
+		<para>
+		The following command shows how the tool can be run:
+				<programlisting>
+				<![CDATA[
+bin/opennlp ChunkerEvaluator
+Usage: opennlp ChunkerEvaluator [-encoding charsetName] -data data -model model]]>
+		</programlisting>
+		A sample of the command considering you have a data sample named en-chunker.eval and you trainned a model called en-chunker.bin:
+				<programlisting>
+				<![CDATA[
+bin/opennlp ChunkerEvaluator -lang en -encoding UTF-8 -data en-chunker.eval -model en-chunker.bin]]>
+		</programlisting>		
+		and here is a sample output:  
+		<programlisting>
+		<![CDATA[
+Precision: 0.9255923572240226
+Recall: 0.9220610430991112
+F-Measure: 0.9238233255623465]]>
+		</programlisting>
+		You can also use the tool to perform 10-fold cross validation of the Chunker.
+he following command shows how the tool can be run:
+				<programlisting>
+				<![CDATA[
+bin/opennlp ChunkerCrossValidator
+Usage: opennlp ChunkerCrossValidator -lang language -encoding charset [-iterations num] [-cutoff num]
+-lang language     specifies the language which is being processed.
+-encoding charset  specifies the encoding which should be used for reading and writing text.
+-iterations num    specified the number of training iterations
+-cutoff num        specifies the min number of times a feature must be seen
+-data trainingData      training data used for cross validation]]>
+		</programlisting>
+		It is not necessary to pass a model. The tool will automatically split the data to train and evaluate:
+				<programlisting>
+				<![CDATA[
+bin/opennlp ChunkerCrossValidator -lang pt -encoding UTF-8 -data en-chunker.cross]]>
+		</programlisting>
+		</para>
+		</section>
+	</section>
+</chapter>
\ No newline at end of file

Propchange: incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml
------------------------------------------------------------------------------
    svn:mime-type = text/plain

Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml?rev=1058668&r1=1058667&r2=1058668&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/opennlp.xml Thu Jan 13 17:14:04 2011
@@ -80,7 +80,7 @@ under the License.
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./namefinder.xml" />
 	<!--xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./doccat.xml" /-->
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./postagger.xml" />
-	<!--xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./chunker.xml" /-->
+	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./chunker.xml" />
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./parser.xml" />
 	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="./corpora.xml" />
 </book>