You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/05/31 13:07:16 UTC
svn commit: r1129623 -
/incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml
Author: joern
Date: Tue May 31 11:07:15 2011
New Revision: 1129623
URL: http://svn.apache.org/viewvc?rev=1129623&view=rev
Log:
OPENNLP-194 Fixed too long lines
Modified:
incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml
Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml?rev=1129623&r1=1129622&r2=1129623&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/chunker.xml Tue May 31 11:07:15 2011
@@ -28,13 +28,15 @@ under the License.
<section id="tools.parser.chunking">
<title>Chunking</title>
<para>
- Text chunking consists of dividing a text in syntactically correlated parts of words, like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
+ Text chunking consists of dividing a text in syntactically correlated parts of words,
+ like noun groups, verb groups, but does not specify their internal structure, nor their role in the main sentence.
</para>
<section id="tools.parser.chunking.cmdline">
<title>Chunker Tool</title>
<para>
- The easiest way to try out the Chunker is the command line tool. The tool is only intended for demonstration and testing.
+ The easiest way to try out the Chunker is the command line tool. The tool is only intended
+ for demonstration and testing.
</para>
<para>
Download the english maxent chunker model from the website and start the Chunker Tool with this command:
@@ -48,16 +50,25 @@ bin/opennlp ChunkerME en-chunker.bin]]>
Copy these two sentences to the console:
<programlisting>
<![CDATA[
-Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.
-Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._.]]>
+Rockwell_NNP International_NNP Corp._NNP 's_POS Tulsa_NNP unit_NN said_VBD it_PRP signed_VBD
+ a_DT tentative_JJ agreement_NN extending_VBG its_PRP$ contract_NN with_IN Boeing_NNP Co._NNP
+ to_TO provide_VB structural_JJ parts_NNS for_IN Boeing_NNP 's_POS 747_CD jetliners_NNS ._.
+Rockwell_NNP said_VBD the_DT agreement_NN calls_VBZ for_IN it_PRP to_TO supply_VB 200_CD
+ additional_JJ so-called_JJ shipsets_NNS for_IN the_DT planes_NNS ._.]]>
</programlisting>
the Chunker will now echo the sentences grouped tokens to the console:
<programlisting>
<![CDATA[
-[NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ] [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ] [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ] [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.
-[NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ] [NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ] [PP for_IN ] [NP the_DT planes_NNS ] ._.]]>
+[NP Rockwell_NNP International_NNP Corp._NNP ] [NP 's_POS Tulsa_NNP unit_NN ] [VP said_VBD ]
+ [NP it_PRP ] [VP signed_VBD ] [NP a_DT tentative_JJ agreement_NN ] [VP extending_VBG ]
+ [NP its_PRP$ contract_NN ] [PP with_IN ] [NP Boeing_NNP Co._NNP ] [VP to_TO provide_VB ]
+ [NP structural_JJ parts_NNS ] [PP for_IN ] [NP Boeing_NNP ] [NP 's_POS 747_CD jetliners_NNS ] ._.
+[NP Rockwell_NNP ] [VP said_VBD ] [NP the_DT agreement_NN ] [VP calls_VBZ ] [SBAR for_IN ]
+ [NP it_PRP ] [VP to_TO supply_VB ] [NP 200_CD additional_JJ so-called_JJ shipsets_NNS ]
+ [PP for_IN ] [NP the_DT planes_NNS ] ._.]]>
</programlisting>
- The tag set used by the english pos model is the Penn Treebank tag set. See the link below for a description of the tags.
+ The tag set used by the english pos model is the Penn Treebank tag set.
+ See the link below for a description of the tags.
</para>
</section>
<section id="tools.parser.chunking.api">
@@ -70,13 +81,23 @@ Rockwell_NNP said_VBD the_DT agreement_N
<section id="tools.chunker.training">
<title>Chunker Training</title>
<para>
- The pre-trained models might not be available for a desired language, can not detect important entities or the performance is not good enough outside the news domain.
+ The pre-trained models might not be available for a desired language,
+ can not detect important entities or the performance is not good enough outside the news domain.
</para>
<para>
- These are the typical reason to do custom training of the chunker on a new corpus or on a corpus which is extended by private training data taken from the data which should be analyzed.
+ These are the typical reason to do custom training of the chunker on a ne
+ corpus or on a corpus which is extended by private training data taken from the data which should be analyzed.
</para>
<para>
- The training data must be converted to the OpenNLP chunker training format, that is based on <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>: The train data consist of three columns separated by spaces. Each word has been put on a separate line and there is an empty line after each sentence. The first column contains the current word, the second its part-of-speech tag and the third its chunk tag. The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words and I-VP for verb phrase words. Most chunk types have two types of chunk tags, B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk. Here is an example of the file format:
+ The training data must be converted to the OpenNLP chunker training format,
+ that is based on <ulink url="http://www.cnts.ua.ac.be/conll2000/chunking">CoNLL2000</ulink>:
+ The train data consist of three columns separated by spaces. Each word has been put on a
+ separate line and there is an empty line after each sentence. The first column contains
+ the current word, the second its part-of-speech tag and the third its chunk tag.
+ The chunk tags contain the name of the chunk type, for example I-NP for noun phrase words
+ and I-VP for verb phrase words. Most chunk types have two types of chunk tags,
+ B-CHUNK for the first word of the chunk and I-CHUNK for each other word in the chunk.
+ Here is an example of the file format:
</para>
<para>
Sample sentence of the training data:
@@ -103,25 +124,30 @@ September NNP B-NP
<section id="tools.chunker.training.tool">
<title>Training Tool</title>
<para>
- OpenNLP has a command line tool which is used to train the models available from the model download page on various corpora.
+ OpenNLP has a command line tool which is used to train the models available from the
+ model download page on various corpora.
</para>
<para>
Usage of the tool:
<programlisting>
<![CDATA[
$ bin/opennlp ChunkerTrainerME
-Usage: opennlp ChunkerTrainerME-lang language -encoding charset [-iterations num] [-cutoff num] -data trainingData -model model
+Usage: opennlp ChunkerTrainerME-lang language -encoding charset [-iterations num] \
+[-cutoff num] -data trainingData -model model
-lang language specifies the language which is being processed.
-encoding charset specifies the encoding which should be used for reading and writing text.
-iterations num specified the number of training iterations
-cutoff num specifies the min number of times a feature must be seen]]>
</programlisting>
- Its now assumed that the english chunker model should be trained from a file called en-chunker.train which is encoded as UTF-8. The following command will train the name finder and write the model to en-chunker.bin:
+ Its now assumed that the english chunker model should be trained from a file called
+ en-chunker.train which is encoded as UTF-8. The following command will train the
+ name finder and write the model to en-chunker.bin:
<programlisting>
<![CDATA[
bin/opennlp ChunkerTrainerME -encoding UTF-8 -lang en -data en-chunker.train -model en-chunker.bin]]>
</programlisting>
- Additionally its possible to specify the number of iterations, the cutoff and to overwrite all types in the training data with a single type.
+ Additionally its possible to specify the number of iterations, the cutoff and to overwrite
+ all types in the training data with a single type.
</para>
</section>
</section>
@@ -129,7 +155,8 @@ bin/opennlp ChunkerTrainerME -encoding U
<section id="tools.chunker.evaluation">
<title>Chunker Evaluation</title>
<para>
- The built in evaluation can measure the chunker performance. The performance is either measured on a test dataset or via cross validation.
+ The built in evaluation can measure the chunker performance. The performance is either
+ measured on a test dataset or via cross validation.
</para>
<section id="tools.chunker.evaluation.tool">
<title>Chunker Evaluation Tool</title>
@@ -140,7 +167,8 @@ bin/opennlp ChunkerTrainerME -encoding U
bin/opennlp ChunkerEvaluator
Usage: opennlp ChunkerEvaluator [-encoding charsetName] -data data -model model]]>
</programlisting>
- A sample of the command considering you have a data sample named en-chunker.eval and you trainned a model called en-chunker.bin:
+ A sample of the command considering you have a data sample named en-chunker.eval
+ and you trainned a model called en-chunker.bin:
<programlisting>
<![CDATA[
bin/opennlp ChunkerEvaluator -lang en -encoding UTF-8 -data en-chunker.eval -model en-chunker.bin]]>