You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@opennlp.apache.org by jo...@apache.org on 2011/11/30 11:29:10 UTC

svn commit: r1208368 - /incubator/opennlp/trunk/opennlp-docs/src/docbkx/introduction.xml

Author: joern
Date: Wed Nov 30 10:29:10 2011
New Revision: 1208368

URL: http://svn.apache.org/viewvc?rev=1208368&view=rev
Log:
OPENNLP-404 Now explains generic usage of OpenNLP. Thanks to Aliaksandr Autayeu for providing a patch.

Modified:
    incubator/opennlp/trunk/opennlp-docs/src/docbkx/introduction.xml

Modified: incubator/opennlp/trunk/opennlp-docs/src/docbkx/introduction.xml
URL: http://svn.apache.org/viewvc/incubator/opennlp/trunk/opennlp-docs/src/docbkx/introduction.xml?rev=1208368&r1=1208367&r2=1208368&view=diff
==============================================================================
--- incubator/opennlp/trunk/opennlp-docs/src/docbkx/introduction.xml (original)
+++ incubator/opennlp/trunk/opennlp-docs/src/docbkx/introduction.xml Wed Nov 30 10:29:10 2011
@@ -23,17 +23,280 @@ under the License.
 
 <chapter id="opennlp">
 <title>Introduction</title>
-<para>
-The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
-It supports the most common NLP tasks, such as tokenization, sentence segmentation,
-part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
-These tasks are usually required to build more advanced text processing services.
-OpenNLP also included maximum entropy and perceptron based machine learning.
-</para>
-
-<para>
-The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks.
-An additional goal is to provide a large number of pre-built models for a variety of languages, as
-well as the annotated text resources that those models are derived from.
-</para>
+    <section id="intro.description">
+        <title>Description</title>
+        <para>
+        The Apache OpenNLP library is a machine learning based toolkit for the processing of natural language text.
+        It supports the most common NLP tasks, such as tokenization, sentence segmentation,
+        part-of-speech tagging, named entity extraction, chunking, parsing, and coreference resolution.
+        These tasks are usually required to build more advanced text processing services.
+        OpenNLP also included maximum entropy and perceptron based machine learning.
+        </para>
+
+        <para>
+        The goal of the OpenNLP project will be to create a mature toolkit for the abovementioned tasks.
+        An additional goal is to provide a large number of pre-built models for a variety of languages, as
+        well as the annotated text resources that those models are derived from.
+        </para>
+    </section>
+
+    <section id="intro.general.library.structure">
+        <title>General Library Structure</title>
+        <para>The Apache OpenNLP library contains several components, enabling one to build
+            a full natural language processing pipeline. These components
+            include: sentence detector, tokenizer,
+            name finder, document categorizer, part-of-speech tagger, chunker, parser,
+            coreference resolution. Components contain parts which enable one to execute the
+            respective natural language processing task, to train a model and often also to evaluate a
+            model. Each of these facilities is accessible via its application program
+            interface (API). In addition, a command line interface (CLI) is provided for convenience
+            of experiments and training.
+        </para>
+    </section>
+
+    <section id="intro.api">
+        <title>Application Program Interface (API). Generic Example</title>
+        <para>
+            OpenNLP components have similar APIs. Normally, to execute a task,
+            one should provide a model and an input.
+        </para>
+        <para>
+            A model is usually loaded by providing a FileInputStream with a model to a
+            constructor of the model class:
+            <programlisting language="java">
+                    <![CDATA[
+InputStream modelIn = new FileInputStream("lang-model-name.bin");
+
+try {
+  SomeModel model = new SomeModel(modelIn);
+}
+catch (IOException e) {
+  //handle the exception
+}
+finally {
+  if (null != modelIn) {
+    try {
+      modelIn.close();
+    }
+    catch (IOException e) {
+    }
+  }
+}]]>
+            </programlisting>
+        </para>
+        After the model is loaded the tool itself can be instantiated.
+        <programlisting language="java">
+                <![CDATA[
+ToolName toolName = new ToolName(model);]]>
+        </programlisting>
+        After the tool is instantiated, the processing task can be executed. The input and the
+        output formats are specific to the tool, but often the output is an array of String,
+        and the input is a String or an array of String.
+        <programlisting language="java">
+                <![CDATA[
+String output[] = toolName.executeTask("This is a sample text.");]]>
+        </programlisting>
+    </section>
+
+    <section id="intro.cli">
+        <title>Command line interface (CLI)</title>
+        <section id="intro.cli.description">
+            <title>Description</title>
+            <para>
+                OpenNLP provides a command line script, serving as a unique entry point to all
+                included tools. The script is located in the bin directory of OpenNLP binary
+                distribution. Included are versions for Windows: opennlp.bat and Linux or
+                compatible systems: opennlp.
+            </para>
+        </section>
+
+        <section id="intro.cli.setup">
+            <title>Setting up</title>
+            <para>
+                OpenNLP script uses JAVA_CMD and JAVA_HOME variables to determine which command to
+                use to execute Java virtual machine.
+            </para>
+            <para>
+                OpenNLP script uses OPENNLP_HOME variable to determine the location of the binary
+                distribution of OpenNLP. It is recommended to point this variable to the binary
+                distribution of current OpenNLP version and update PATH variable to include
+                $OPENNLP_HOME/bin or %OPENNLP_HOME%\bin.
+            </para>
+            <para>
+                Such configuration allows calling OpenNLP conveniently. Examples below
+                suppose this configuration has been done.
+            </para>
+        </section>
+
+        <section id="intro.cli.generic">
+            <title>Generic Example</title>
+
+            <para>
+                Apache OpenNLP provides a common command line script to access all its tools:
+                <screen>
+                <![CDATA[
+$ opennlp]]>
+                 </screen>
+                This script prints current version of the library and lists all available tools:
+                <screen>
+                <![CDATA[
+OpenNLP <VERSION>. Usage: opennlp TOOL
+where TOOL is one of:
+  Doccat                            learnable document categorizer
+  DoccatTrainer                     trainer for the learnable document categorizer
+  DoccatConverter                   converts leipzig data format to native OpenNLP format
+  DictionaryBuilder                 builds a new dictionary
+  SimpleTokenizer                   character class tokenizer
+  TokenizerME                       learnable tokenizer
+  TokenizerTrainer                  trainer for the learnable tokenizer
+  TokenizerMEEvaluator              evaluator for the learnable tokenizer
+  TokenizerCrossValidator           K-fold cross validator for the learnable tokenizer
+  TokenizerConverter                converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format
+  DictionaryDetokenizer
+  SentenceDetector                  learnable sentence detector
+  SentenceDetectorTrainer           trainer for the learnable sentence detector
+  SentenceDetectorEvaluator         evaluator for the learnable sentence detector
+  SentenceDetectorCrossValidator    K-fold cross validator for the learnable sentence detector
+  SentenceDetectorConverter         converts foreign data formats (namefinder,conllx,pos) to native OpenNLP format
+  TokenNameFinder                   learnable name finder
+  TokenNameFinderTrainer            trainer for the learnable name finder
+  TokenNameFinderEvaluator          Measures the performance of the NameFinder model with the reference data
+  TokenNameFinderCrossValidator     K-fold cross validator for the learnable Name Finder
+  TokenNameFinderConverter          converts foreign data formats (bionlp2004,conll03,conll02,ad) to native OpenNLP format
+  CensusDictionaryCreator           Converts 1990 US Census names into a dictionary
+  POSTagger                         learnable part of speech tagger
+  POSTaggerTrainer                  trains a model for the part-of-speech tagger
+  POSTaggerEvaluator                Measures the performance of the POS tagger model with the reference data
+  POSTaggerCrossValidator           K-fold cross validator for the learnable POS tagger
+  POSTaggerConverter                converts conllx data format to native OpenNLP format
+  ChunkerME                         learnable chunker
+  ChunkerTrainerME                  trainer for the learnable chunker
+  ChunkerEvaluator                  Measures the performance of the Chunker model with the reference data
+  ChunkerCrossValidator             K-fold cross validator for the chunker
+  ChunkerConverter                  converts ad data format to native OpenNLP format
+  Parser                            performs full syntactic parsing
+  ParserTrainer                     trains the learnable parser
+  BuildModelUpdater                 trains and updates the build model in a parser model
+  CheckModelUpdater                 trains and updates the check model in a parser model
+  TaggerModelReplacer               replaces the tagger model in a parser model
+All tools print help when invoked with help parameter
+Example: opennlp SimpleTokenizer help
+]]>
+                </screen>
+            </para>
+            <para>OpenNLP tools have similar command line structure and options. To discover tool
+                options, run it with no parameters:
+                <screen>
+                <![CDATA[
+$ opennlp ToolName]]>
+                 </screen>
+                The tool will output two blocks of help.
+            </para>
+            <para>
+                The first block describes the general structure of this tool command line:
+                <screen>
+                <![CDATA[
+Usage: opennlp TokenizerTrainer[.namefinder|.conllx|.pos] [-abbDict path] ...  -model modelFile ...]]>
+                </screen>
+                The general structure of this tool command line includes the obligatory tool name
+                (TokenizerTrainer), the optional format parameters ([.namefinder|.conllx|.pos]),
+                the optional parameters ([-abbDict path] ...), and the obligatory parameters
+                (-model modelFile ...).
+            </para>
+            <para>
+                The format parameters enable direct processing of non-native data without conversion.
+                Each format might have its own parameters, which are displayed if the tool is
+                executed without or with help parameter:
+                <screen>
+                <![CDATA[
+$ opennlp TokenizerTrainer.conllx help]]>
+                </screen>
+                <screen>
+                <![CDATA[
+Usage: opennlp TokenizerTrainer.conllx [-abbDict path] [-alphaNumOpt isAlphaNumOpt] ...
+
+Arguments description:
+        -abbDict path
+                abbreviation dictionary in XML format.
+        ...]]>
+                </screen>
+                To switch the tool to a specific format, add a dot and the format name after
+                the tool name:
+                <screen>
+                <![CDATA[
+$ opennlp TokenizerTrainer.conllx -model en-pos.bin ...]]>
+                </screen>
+            </para>
+            <para>
+                The second block of the help message describes the individual arguments:
+                <screen>
+                <![CDATA[
+Arguments description:
+        -type maxent|perceptron|perceptron_sequence
+                The type of the token name finder model. One of maxent|perceptron|perceptron_sequence.
+        -dict dictionaryPath
+                The XML tag dictionary file
+        ...]]>
+                </screen>
+            </para>
+            <para>
+                Most tools for processing need to be provided at least a model:
+                <screen>
+                <![CDATA[
+$ opennlp ToolName lang-model-name.bin]]>
+                 </screen>
+                When tool is executed this way, the model is loaded and the tool is waiting for
+                the input from standard input. This input is processed and printed to standard
+                output.
+            </para>
+            <para>Alternative, or one should say, most commonly used way is to use console input and
+                output redirection options to provide also an input and an output files:
+                <screen>
+            <![CDATA[
+$ opennlp ToolName lang-model-name.bin < input.txt > output.txt]]>
+                </screen>
+            </para>
+            <para>
+                Most tools for model training need to be provided first a model name,
+                optionally some training options (such as model type, number of iterations),
+                and then the data.
+            </para>
+            <para>
+                A model name is just a file name.
+            </para>
+            <para>
+                Training options often include number of iterations, cutoff,
+                abbreviations dictionary or something else. Sometimes it is possible to provide these
+                options via training options file. In this case these options are ignored and the
+                ones from the file are used.
+            </para>
+            <para>
+                For the data one has to specify the location of the data (filename) and often
+                language and encoding.
+            </para>
+            <para>
+                A generic example of a command line to launch a tool trainer might be:
+                <screen>
+                <![CDATA[
+$ opennlp ToolNameTrainer -model en-model-name.bin -lang en -data input.train -encoding UTF-8]]>
+                 </screen>
+                or with a format:
+                <screen>
+                <![CDATA[
+$ opennlp ToolNameTrainer.conll03 -model en-model-name.bin -lang en -data input.train \
+                                  -types per -encoding UTF-8]]>
+                 </screen>
+            </para>
+            <para>Most tools for model evaluation are similar to those for task execution, and
+                need to be provided fist a model name, optionally some evaluation options (such
+                as whether to print misclassified samples), and then the test data. A generic
+                example of a command line to launch an evaluation tool might be:
+                <screen>
+                <![CDATA[
+$ opennlp ToolNameEvaluator -model en-model-name.bin -lang en -data input.test -encoding UTF-8]]>
+                 </screen>
+            </para>
+        </section>
+    </section>
+
 </chapter>
\ No newline at end of file