You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2008/08/28 23:28:16 UTC
svn commit: r689997 [21/32] - in /incubator/uima/uimaj/trunk/uima-docbooks:
./ src/ src/docbook/overview_and_setup/ src/docbook/references/
src/docbook/tools/ src/docbook/tutorials_and_users_guides/
src/docbook/uima/organization/ src/olink/references/
Modified: incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/tutorials_and_users_guides/annotator_analysis_engine_guide.xml
URL: http://svn.apache.org/viewvc/incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/tutorials_and_users_guides/annotator_analysis_engine_guide.xml?rev=689997&r1=689996&r2=689997&view=diff
==============================================================================
--- incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/tutorials_and_users_guides/annotator_analysis_engine_guide.xml (original)
+++ incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/tutorials_and_users_guides/annotator_analysis_engine_guide.xml Thu Aug 28 14:28:14 2008
@@ -1,2592 +1,2592 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
-"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
-<!ENTITY imgroot "../images/tutorials_and_users_guides/tug.aae/">
-<!ENTITY % uimaents SYSTEM "../entities.ent">
-%uimaents;
-]>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<chapter id="ugr.tug.aae">
- <title>Annotator and Analysis Engine Developer's Guide</title>
- <titleabbrev>Annotator & AE Developer's Guide</titleabbrev>
-
- <para>This chapter describes how to develop UIMA <emphasis>type systems</emphasis>,
- <emphasis>Annotators</emphasis> and <emphasis>Analysis Engines</emphasis> using
- the UIMA SDK. It is helpful to read the UIMA Conceptual Overview chapter for a review on
- these concepts.</para>
-
- <para>An <emphasis>Analysis Engine (AE)</emphasis> is a program that analyzes artifacts
- (e.g. documents) and infers information from them.</para>
-
- <para>Analysis Engines are constructed from building blocks called
- <emphasis>Annotators</emphasis>. An annotator is a component that contains analysis
- logic. Annotators analyze an artifact (for example, a text document) and create
- additional data (metadata) about that artifact. It is a goal of UIMA that annotators need
- not be concerned with anything other than their analysis logic – for example the
- details of their deployment or their interaction with other annotators.</para>
-
- <para>An Analysis Engine (AE) may contain a single annotator (this is referred to as a
- <emphasis>Primitive AE)</emphasis>, or it may be a composition of others and therefore
- contain multiple annotators (this is referred to as an <emphasis>Aggregate
- AE</emphasis>). Primitive and aggregate AEs implement the same interface and can be used
- interchangeably by applications.</para>
-
- <para>Annotators produce their analysis results in the form of typed <emphasis>Feature
- Structures</emphasis>, which are simply data structures that have a type and a set of
- (attribute, value) pairs. An <emphasis>annotation</emphasis> is a particular type of
- Feature Structure that is attached to a region of the artifact being analyzed (a span of
- text in a document, for example).</para>
-
- <para>For example, an annotator may produce an Annotation over the span of text
- <literal>President Bush</literal>, where the type of the Annotation is
- <literal>Person</literal> and the attribute <literal>fullName</literal> has the
- value <literal>George W. Bush</literal>, and its position in the artifact is character
- position 12 through character position 26.</para>
-
- <para>It is also possible for annotators to record information associated with the entire
- document rather than a particular span (these are considered Feature Structures but not
- Annotations).</para>
-
- <para>All feature structures, including annotations, are represented in the UIMA
- <emphasis>Common Analysis Structure(CAS)</emphasis>. The CAS is the central data
- structure through which all UIMA components communicate. Included with the UIMA SDK is an
- easy-to-use, native Java interface to the CAS called the <emphasis>JCas</emphasis>.
- The JCas represents each feature structure as a Java object; the example feature
- structure from the previous paragraph would be an instance of a Java class Person with
- getFullName() and setFullName() methods. Though the examples in this guide all use the
- JCas, it is also possible to directly access the underlying CAS system; for more
- information see <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>
- .</para>
-
- <para>The remainder of this chapter will refer to the analysis of text documents and the
- creation of annotations that are attached to spans of text in those documents. Keep in mind
- that the CAS can represent arbitrary types of feature structures, and feature structures
- can refer to other feature structures. For example, you can use the CAS to represent a parse
- tree for a document. Also, the artifact that you are analyzing need not be a text
- document.</para>
-
- <para>This guide is organized as follows:</para>
-
- <itemizedlist>
- <listitem>
- <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.getting_started"/></emphasis> is a
- tutorial with step-by-step instructions for how to develop and test a simple UIMA annotator.</para>
- </listitem>
- <listitem>
- <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.configuration_logging"/>
- </emphasis> discusses how to make your UIMA annotator configurable, and how it can write messages to the UIMA
- log file.</para>
- </listitem>
- <listitem>
- <para> <emphasis role="bold-italic"><xref linkend="ugr.tug.aae.building_aggregates"/></emphasis>
- describes how annotators can be combined into aggregate analysis engines. It also describes how one
- annotator can make use of the analysis results produced by an annotator that has run previously.</para>
- </listitem>
- <listitem>
- <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.other_examples"/></emphasis>
- describes several other examples you may find interesting, including</para>
-
- <itemizedlist spacing="compact">
- <listitem>
- <para>SimpleTokenAndSentenceAnnotator
- – a simple tokenizer and sentence annotator.</para>
- </listitem>
-
- <listitem>
- <para>PersonTitleDBWriterCasConsumer – a sample CAS Consumer which populates a relational
- database with some annotations. It uses JDBC and in this example, hooks up with the Open Source Apache
- Derby database. </para>
- </listitem>
- </itemizedlist>
- </listitem>
- <listitem>
- <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.additional_topics"/></emphasis>
- describes additional features of the UIMA SDK that may help you in building your own annotators and analysis
- engines.</para>
- </listitem>
- <listitem>
- <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.common_pitfalls"/> </emphasis>
- contains some useful guidelines to help you ensure that your annotators will work correctly in any UIMA
- application.</para>
- </listitem>
- </itemizedlist>
-
- <para>This guide does not discuss how to build UIMA Applications, which are programs that
- use Analysis Engines, along with other components, e.g. a search engine, document store,
- and user interface, to deliver a complete package of functionality to an end-user. For
- information on application development, see <olink
- targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.application"
- xrefstyle="select: label quotedtitle"/>
- .</para>
-
- <section id="ugr.tug.aae.getting_started">
- <title>Getting Started</title>
-
- <para>This section is a step-by-step tutorial that will get you started developing UIMA
- annotators. All of the files referred to by the examples in this chapter are in the
- <literal>examples</literal> directory of the UIMA SDK. This directory is designed to
- be imported into your Eclipse workspace; see <olink
- targetdoc="&uima_docs_overview;"
- targetptr="ugr.ovv.eclipse_setup.example_code"/> for instructions on how to do
- this.
- See <olink targetdoc="&uima_docs_overview;"
- targetptr="ugr.ovv.eclipse_setup.linking_uima_javadocs"/> for how to attach the UIMA
- Javadocs to the jar files.
- Also you may wish to refer to the UIMA SDK Javadocs located in the <ulink
- url="file:../../api/index.html">docs/api</ulink> directory.</para>
-
- <note><para>In Eclipse 3.1, if you highlight a UIMA class or method defined in the UIMA SDK
- Javadocs, you can conveniently have Eclipse open the corresponding Javadoc for that
- class or method in a browser, by pressing Shift + F2.</para></note>
- <note><para>If you downloaded the source distribution for UIMA, you can attach that as
- well to the library Jar files; for information on how to do this, see
- <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.javadocs"/>.</para></note>
-
- <para>The example annotator that we are going to walk through will detect room numbers for
- rooms where the room numbering scheme follows some simple conventions. In our example,
- there are two kinds of patterns we want to find; here are some examples, together with
- their corresponding regular expression patterns:
- <variablelist>
- <varlistentry>
- <term>Yorktown patterns:</term>
- <listitem><para>20-001, 31-206, 04-123(Regular Expression Pattern:
- ##-[0-2]##)</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>Hawthorne patterns:</term>
- <listitem><para>GN-K35, 1S-L07, 4N-B21 (Regular Expression Pattern:
- [G1-4][NS]-[A-Z]##)</para></listitem>
- </varlistentry>
- </variablelist> </para>
-
- <para>There are several steps to develop and test a simple UIMA annotator.</para>
-
- <orderedlist spacing="compact"><listitem><para>Define the CAS types that the
- annotator will use.</para></listitem>
-
- <listitem><para>Generate the Java classes for these types.</para></listitem>
-
- <listitem><para>Write the actual annotator Java code.</para></listitem>
-
- <listitem><para>Create the Analysis Engine descriptor.</para></listitem>
-
- <listitem><para>Test the annotator. </para></listitem></orderedlist>
-
- <para>These steps are discussed in the next sections.</para>
-
- <section id="ugr.tug.aae.defining_types">
- <title>Defining Types</title>
-
- <para>The first step in developing an annotator is to define the CAS Feature Structure
- types that it creates. This is done in an XML file called a <emphasis>Type System
- Descriptor</emphasis>. UIMA defines basic primitive types such as
- Boolean, Byte, Short, Integer, Long, Float, and Double, as well as Arrays of these primitive
- types. UIMA also defines the built-in types <literal>TOP</literal>, which is the root
- of the type system, analogous to Object in Java; <literal>FSArray</literal>, which is
- an array of Feature Structures (i.e. an array of instances of TOP); and
- <literal>Annotation</literal>, which we will discuss in more detail in this section.</para>
-
- <para>UIMA includes an Eclipse plug-in that will help you edit Type System
- Descriptors, so if you are using Eclipse you will not need to worry about the details of
- the XML syntax. See <olink targetdoc="&uima_docs_overview;"
- targetptr="ugr.ovv.eclipse_setup"/> for instructions on setting up Eclipse and
- installing the plugin.</para>
-
- <para>The Type System Descriptor for our annotator is located in the file
- <literal>descriptors/tutorial/ex1/TutorialTypeSystem.xml.</literal> (This
- and all other examples are located in the <literal>examples</literal> directory of
- the installation of the UIMA SDK, which can be imported into an Eclipse project for
- your convenience, as described in <olink targetdoc="&uima_docs_overview;"
- targetptr="ugr.ovv.eclipse_setup.example_code"/>.)</para>
-
- <para>In Eclipse, expand the <literal>uimaj-examples</literal> project in the
- Package Explorer view, and browse to the file
- <literal>descriptors/tutorial/ex1/TutorialTypeSystem.xml</literal>.
- Right-click on the file in the navigator and select Open With → Component
- Descriptor Editor. Once the editor opens, click on the <quote>Type System</quote>
- tab at the bottom of the editor window. You should see a view such as the
- following:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata scale="100" format="JPG" fileref="&imgroot;image002.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of editor for Type System Definitions</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>Our annotator will need only one type –
- <literal>org.apache.uima.tutorial.RoomNumber</literal>. (We use the same
- namespace conventions as are used for Java classes.) Just as in Java, types have
- supertypes. The supertype is listed in the second column of the left table. In this
- case our RoomNumber annotation extends from the built-in type
- <literal>uima.tcas.Annotation</literal>.</para>
-
- <para>Descriptions can be included with types and features. In this example, there is a
- description associated with the <literal>building</literal> feature. To see it,
- hover the mouse over the feature.</para>
-
- <para>The bottom tab labeled <quote>Source</quote> will show you the XML source file
- associated with this descriptor.</para>
-
- <para>The built-in Annotation type declares three fields (called
- <emphasis>Features</emphasis> in CAS terminology). The features <literal>begin</literal>
- and <literal>end</literal> store the character offsets of the span of text to which the
- annotation refers. The feature <literal>sofa</literal> (Subject of Analysis) indicates
- which document the begin and end offsets point into. The <literal>sofa</literal> feature
- can be ignored for now since we assume in this tutorial that the CAS contains only one
- subject of analysis (document).</para>
- <para>Our RoomNumber type will inherit these three features from
- <literal>uima.tcas.Annotation</literal>, its supertype; they are not visible in
- this view because inherited features are not shown. One additional feature,
- <literal>building</literal>, is declared. It takes a String as its value. Instead
- of String, we could have declared the range-type of our feature to be any other CAS type
- (defined or built-in).</para>
-
- <para>If you are not using Eclipse, if you need to edit the type system, do so using any XML
- or text editor, directly. The following is the actual XML representation of the Type
- System displayed above in the editor:</para>
-
-
- <programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8" ?>
- <typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
- <name>TutorialTypeSystem</name>
- <description>Type System Definition for the tutorial examples -
- as of Exercise 1</description>
- <vendor>Apache Software Foundation</vendor>
- <version>1.0</version>
- <types>
- <typeDescription>
- <name>org.apache.uima.tutorial.RoomNumber</name>
- <description></description>
- <supertypeName>uima.tcas.Annotation</supertypeName>
- <features>
- <featureDescription>
- <name>building</name>
- <description>Building containing this room</description>
- <rangeTypeName>uima.cas.String</rangeTypeName>
- </featureDescription>
- </features>
- </typeDescription>
- </types>
- </typeSystemDescription>]]></programlisting>
-
- </section>
-
- <section id="ugr.tug.aae.generating_jcas_sources">
- <title>Generating Java Source Files for CAS Types</title>
-
- <para>When you save a descriptor that you have modified, the Component Descriptor
- Editor will automatically generate Java classes corresponding to the types that are
- defined in that descriptor (unless this has been disabled), using a utility called
- JCasGen. These Java classes will have the same name (including package) as the CAS
- types, and will have get and set methods for each of the features that you have
- defined.</para>
-
- <para>This feature is enabled/disabled using the UIMA menu pulldown (or the Eclipse
- Preferences → UIMA). If automatic running of JCasGen is not happening, please
- make sure the option is checked:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image004.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of enabling automatic running of JCasGen</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>The Java class for the example org.apache.uima.tutorial.RoomNumber type can
- be found in <literal>src/org/apache/uima/tutorial/RoomNumber.java</literal>
- . You will see how to use these generated classes in the next section.</para>
-
- <para>If you are not using the Component Descriptor Editor, you will need to generate
- these Java classes by using the <emphasis>JCasGen</emphasis> tool. JCasGen reads a
- Type System Descriptor XML file and generates the corresponding Java classes that
- you can then use in your annotator code. To launch JCasGen, run the jcasgen shell
- script located in the <literal>/bin</literal> directory of the UIMA SDK
- installation. This should launch a GUI that looks something like this:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image006.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of JCasGen</phrase></textobject>
- </mediaobject>
-</screenshot>
-
- <para>Use the <quote>Browse</quote> buttons to select your input file
- (TutorialTypeSystem.xml) and output directory (the root of the source tree into
- which you want the generated files placed). Then click the <quote>Go</quote>
- button. If the Type System Descriptor has no errors, new Java source files will be
- generated under the specified output directory.</para>
-
- <para>There are some additional options to choose from when running JCasGen; please
- refer to the <olink targetdoc="&uima_docs_tools;"
- targetptr="ugr.tools.jcasgen"/> for details.</para>
- </section>
-
- <section id="ugr.tug.aae.developing_annotator_code">
- <title>Developing Your Annotator Code</title>
-
- <para>Annotator implementations all implement a standard interface (AnalysisComponent), having several
- methods, the most important of which are:
-
- <itemizedlist spacing="compact">
- <listitem>
- <para><literal>initialize</literal>, </para>
- </listitem>
-
- <listitem>
- <para><literal>process</literal>, and </para>
- </listitem>
-
- <listitem>
- <para><literal>destroy</literal>. </para>
- </listitem>
- </itemizedlist></para>
-
- <para><literal>initialize</literal> is called by the framework once when it first creates an instance of the
- annotator class. <literal>process</literal> is called once per item being processed.
- <literal>destroy</literal> may be called by the application when it is done using your annotator. There is a
- default implementation of this interface for annotators using the JCas, called JCasAnnotator_ImplBase, which
- has implementations of all required methods except for the process method.</para>
-
- <para>Our annotator class extends the JCasAnnotator_ImplBase; most annotators that use the JCas will extend
- from this class, so they only have to implement the process method. This class is not restricted to handling
- just text; see <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>.</para>
-
- <para>Annotators are not required to extend from the JCasAnnotator_ImplBase class; they may instead
- directly implement the AnalysisComponent interface, and provide all method implementations themselves.
- <footnote>
- <para>Note that AnalysisComponent is not specific to JCAS. There is a method getRequiredCasInterface()
- which the user would have to implement to return <literal>JCas.class</literal>. Then in the
- <literal>process(AbstractCas cas)</literal> method, they would need to typecast
- <literal>cas</literal> to type <literal>JCas</literal>.</para></footnote> This allows you to have
- your annotator inherit from some other superclass if necessary. If you would like to do this, see the Javadocs
- for JCasAnnotator for descriptions of the methods you must implement.</para>
-
- <para>Annotator classes need to be public, cannot be declared abstract, and must have public, 0-argument
- constructors, so that they can be instantiated by the framework. <footnote>
- <para> Although Java classes in which you do not define any constructor will, by default, have a 0-argument
- constructor that doesn't do anything, a class in which you have defined at least one constructor does
- not get a default 0-argument constructor.</para> </footnote> .</para>
-
- <para>The class definition for our RoomNumberAnnotator implements the process method, and is shown here. You
- can find the source for this in the
- <literal>uimaj-examples/src/org/apache/uima/tutorial/ex1/RoomNumberAnnotator.java</literal> .
- <note>
- <para>In Eclipse, in the <quote>Package Explorer</quote> view, this will appear by default in the project
- <literal>uimaj-examples</literal>, in the folder <literal>src</literal>, in the package
- <literal>org.apache.uima.tutorial.ex1</literal>.</para></note> In Eclipse, open the
- RoomNumberAnnotator.java in the uimaj-examples project, under the src directory.</para>
-
-
- <programlisting>package org.apache.uima.tutorial.ex1;
-
-import java.util.regex.Matcher;
-import java.util.regex.Pattern;
-
-import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
-import org.apache.uima.jcas.JCas;
-import org.apache.uima.tutorial.RoomNumber;
-
-/**
- * Example annotator that detects room numbers using
- * Java 1.4 regular expressions.
- */
-public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
- private Pattern mYorktownPattern =
- Pattern.compile("\\b[0-4]\\d-[0-2]\\d\\d\\b");
-
- private Pattern mHawthornePattern =
- Pattern.compile("\\b[G1-4][NS]-[A-Z]\\d\\d\\b");
-
- public void process(JCas aJCas) {
- // Discussed Later
- }
-}</programlisting>
-
- <para>The two Java class fields, mYorktownPattern and mHawthornePattern, hold regular expressions that
- will be used in the process method. Note that these two fields are part of the Java implementation of the
- annotator code, and not a part of the CAS type system. We are using the regular expression facility that is
- built into Java 1.4. It is not critical that you know the details of how this works, but if you are curious the
- details can be found in the Java API docs for the java.util.regex package.</para>
-
- <para>The only method that we are required to implement is <literal>process</literal>. This method is typically
- called once for each document that is being analyzed. This method takes one argument, which is a JCas instance;
- this holds the document to be analyzed and all of the analysis results. <footnote>
- <para>Version 1 of UIMA specified an additional parameter, the ResultSpecification. This provides a
- specification of which types and features are desired to be computed and "output" from this annotator. Its
- use is optional; many annotators ignore it.</para>
- <para> This parameter has been replaced by specific set/getResultSpecification() methods, which allow
- the annotator to receive a signal (a method call) when the result specification changes.</para>
- </footnote></para>
-
-
- <programlisting>public void process(JCas aJCas) {
- // get document text
- String docText = aJCas.getDocumentText();
- // search for Yorktown room numbers
- Matcher matcher = mYorktownPattern.matcher(docText);
- int pos = 0;
- while (matcher.find(pos)) {
- // found one - create annotation
- RoomNumber annotation = new RoomNumber(aJCas);
- annotation.setBegin(matcher.start());
- annotation.setEnd(matcher.end());
- annotation.setBuilding("Yorktown");
- annotation.addToIndexes();
- pos = matcher.end();
- }
- // search for Hawthorne room numbers
- matcher = mHawthornePattern.matcher(docText);
- pos = 0;
- while (matcher.find(pos)) {
- // found one - create annotation
- RoomNumber annotation = new RoomNumber(aJCas);
- annotation.setBegin(matcher.start());
- annotation.setEnd(matcher.end());
- annotation.setBuilding("Hawthorne");
- annotation.addToIndexes();
- pos = matcher.end();
- }
-}</programlisting>
-
- <para>The Matcher class is part of the java.util.regex package and is used to find the room numbers in the
- document text. When we find one, recording the annotation is as simple as creating a new Java object and
- calling some set methods:</para>
-
-
- <programlisting>RoomNumber annotation = new RoomNumber(aJCas);
-annotation.setBegin(matcher.start());
-annotation.setEnd(matcher.end());
-annotation.setBuilding("Yorktown");</programlisting>
-
- <para>The <literal>RoomNumber</literal> class was generated from the type system description by the
- Component Descriptor Editor or the JCasGen tool, as discussed in the previous section.</para>
-
- <para>Finally, we call <literal>annotation.addToIndexes()</literal> to add the new annotation to the
- indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps
- an index of all annotations in their order from beginning to end of the document. Subsequent annotators or
- applications use the indexes to iterate over the annotations. </para>
-
- <note>
- <para> If you don't add the instance to the indexes, it cannot be retrieved by down-stream annotators,
- using the indexes. </para></note>
-
- <note>
- <para>You can also call <literal>addToIndexes()</literal> on Feature Structures that are not subtypes of
- <literal>uima.tcas.Annotation</literal>, but these will not be sorted in any particular way. If you want
- to specify a sort order, you can define your own custom indexes in the CAS: see <olink
- targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/> and <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor.aes.index"/> for details.</para></note>
-
- <para>We're almost ready to test the RoomNumberAnnotator. There is just one more step
- remaining.</para>
- </section>
- <section id="ugr.tug.aae.creating_xml_descriptor">
- <title>Creating the XML Descriptor</title>
-
- <para>The UIMA architecture requires that descriptive information about an
- annotator be represented in an XML file and provided along with the annotator class
- file(s) to the UIMA framework at run time. This XML file is called an
- <emphasis>Analysis Engine Descriptor</emphasis>. The descriptor includes:
-
- <itemizedlist><listitem><para>Name, description, version, and vendor</para>
- </listitem>
-
- <listitem><para>The annotator's inputs and outputs, defined in terms of
- the types in a Type System Descriptor</para></listitem>
-
- <listitem><para>Declaration of the configuration parameters that the
- annotator accepts </para></listitem></itemizedlist> </para>
-
- <para>The <emphasis>Component Descriptor Editor</emphasis> plugin, which we
- previously used to edit the Type System descriptor, can also be used to edit Analysis
- Engine Descriptors.</para>
-
- <para>A descriptor for our RoomNumberAnnotator is provided with the UIMA
- distribution under the name
- <literal>descriptors/tutorial/ex1/RoomNumberAnnotator.xml.</literal> To
- edit it in Eclipse, right-click on that file in the navigator and select Open With
- → Component Descriptor Editor.</para> <tip><para>In Eclipse, you can double
- click on the tab at the top of the Component Descriptor Editor's window
- identifying the currently selected editor, and the window will
- <quote>Maximize</quote>. Double click it again to restore the original size.</para>
- </tip>
-
- <para>If you are not using Eclipse, you will need to edit Analysis Engine descriptors
- manually. See <xref linkend="ugr.tug.aae.xml_intro_ae_descriptor"/> for an
- introduction to the Analysis Engine descriptor XML syntax. The remainder of this
- section assumes you are using the Component Descriptor Editor plug-in to edit the
- Analysis Engine descriptor.</para>
-
- <para>The Component Descriptor Editor consists of several tabbed pages; we will only
- need to use a few of them here. For more information on using this editor, see <olink
- targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>.</para>
-
- <para>The initial page of the Component Descriptor Editor is the Overview page, which
- appears as follows:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image008.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of Component Descriptor Editor overview page</phrase>
- </textobject>
- </mediaobject>
-</screenshot>
-
- <para>This presents an overview of the RoomNumberAnnotator Analysis Engine (AE). The
- left side of the page shows that this descriptor is for a
- <emphasis>Primitive</emphasis> AE (meaning it consists of a single annotator),
- and that the annotator code is developed in Java. Also, it specifies the Java class
- that implements our logic (the code which was discussed in the previous section).
- Finally, on the right side of the page are listed some descriptive attributes of our
- annotator.</para>
-
- <para>The other two pages that need to be filled out are the Type System page and the
- Capabilities page. You can switch to these pages using the tabs at the bottom of the
- Component Descriptor Editor. In the tutorial, these are already filled out for
- you.</para>
-
- <para>The RoomNumberAnnotator will be using the TutorialTypeSystem we looked at in
- Section <xref linkend="ugr.tug.aae.defining_types"/>. To specify this, we add
- this type system to the Analysis Engine's list of Imported Type Systems, using
- the Type System page's right side panel, as shown here:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image010.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of CDE Type System page</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>On the Capabilities page, we define our annotator's inputs and outputs, in
- terms of the types in the type system. The Capabilities page is shown below:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.3in" format="JPG" fileref="&imgroot;image012.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of CDE Capabilities page</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>Although capabilities come in sets, having multiple sets is deprecated; here
- we're just using one set. The RoomNumberAnnotator is very simple. It requires
- no input types, as it operates directly on the document text -- which is supplied as a
- part of the CAS initialization (and which is always assumed to be present). It
- produces only one output type (RoomNumber), and it sets the value of the
- <literal>building</literal> feature on that type. This is all represented on the
- Capabilities page.</para>
-
- <para>The Capabilities page has two other parts for specifying languages and Sofas.
- The languages section allows you to specify which languages your Analysis Engine
- supports. The RoomNumberAnnotator happens to be language-independent, so we can
- leave this blank. The Sofas section allows you to specify the names of additional
- subjects of analysis. This capability and the Sofa Mappings at the bottom are
- advanced topics, described in <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.aas"/>. </para>
-
- <para>This is all of the information we need to provide for a simple annotator. If you
- want to peek at the XML that this tool saves you from having to write, click on the
- <quote>Source</quote> tab at the bottom to view the generated XML.</para>
- </section>
-
- <section id="ugr.tug.aae.testing_your_annotator">
- <title>Testing Your Annotator</title>
-
- <para>Having developed an annotator, we need a way to try it out on some example
- documents. The UIMA SDK includes a tool called the Document Analyzer that will allow
- us to do this. To run the Document Analyzer, execute the documentAnalyzer shell
- script that is in the <literal>bin</literal> directory of your UIMA SDK
- installation, or, if you are using the example Eclipse project, execute the
- <quote>UIMA Document Analyzer</quote> run configuration supplied with that
- project. (To do this, click on the menu bar Run → Run ... → and under Java
- Applications in the left box, click on UIMA Document Analyzer.)</para>
-
- <para>You should see a screen that looks like this:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image014.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of UIMA Document Analyzer GUI</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>There are six options on this screen:</para>
-
- <orderedlist><listitem><para>Directory containing documents to analyze</para>
- </listitem>
-
- <listitem><para>Directory where analysis results will be written</para>
- </listitem>
-
- <listitem><para>The XML descriptor for the Analysis Engine (AE) you want to
- run</para></listitem>
-
- <listitem><para>(Optional) an XML tag, within the input documents, that contains
- the text to be analyzed. For example, the value TEXT would cause the AE to only
- analyze the portion of the document enclosed within
- <TEXT>...</TEXT> tags.</para></listitem>
-
- <listitem><para>Language of the document </para></listitem>
-
- <listitem><para>Character encoding </para></listitem></orderedlist>
-
- <para>Use the Browse button next to the third item to set the <quote>Location of AE XML
- Descriptor</quote> field to the descriptor we've just been discussing
- —
- <literal><where-you-installed-uima-e.g.UIMA_HOME>
- /examples/descriptors/tutorial/ex1/RoomNumberAnnotator.xml</literal>
- . Set the other fields to the values shown in the screen shot above (which should be the
- default values if this is the first time you've run the Document Analyzer). Then
- click the <quote>Run</quote> button to start processing.</para>
-
- <para>When processing completes, an <quote>Analysis Results</quote> window should
- appear.</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="3.5in" format="JPG" fileref="&imgroot;image016.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of UIMA Document Analyzer Results GUI</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>Make sure <quote>Java Viewer</quote> is selected as the Results Display
- Format, and <emphasis role="bold">double-click</emphasis> on the document
- UIMASummerSchool2003.txt to view the annotations that were discovered. The view
- should look something like this:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image018.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of UIMA CAS Annotation Viewer GUI</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>You can click the mouse on one of the highlighted annotations to see a list of all
- its features in the frame on the right.</para> <note><para>The legend will only show
- those types which have at least one instance in the CAS, and are declared as outputs in the
- capabilities section of the descriptor (see <xref
- linkend="ugr.tug.aae.creating_xml_descriptor"/>. </para></note>
-
- <para>You can use the DocumentAnalyzer to test any UIMA annotator
- — just make sure that the annotator's classes are in the class
- path.</para>
- </section>
- </section>
-
- <section id="ugr.tug.aae.configuration_logging">
- <title>Configuration and Logging</title>
-
- <section id="ugr.tug.aae.configuration_parameters">
- <title>Configuration Parameters</title>
-
- <para>The example RoomNumberAnnotator from the previous section used hardcoded
- regular expressions and location names, which is obviously not very flexible. For
- example, you might want to have the patterns of room numbers be supplied by a
- configuration parameter, rather than having to redo the annotator's Java code
- to add additional patterns. Rather than add a new hardcoded regular expression for a
- new pattern, a better solution is to use configuration parameters.</para>
-
- <para>UIMA allows annotators to declare configuration parameters in their
- descriptors. The descriptor also specifies default values for the parameters,
- though these can be overridden at runtime.</para>
-
- <section id="ugr.tug.aae.declaring_parameters_in_the_descriptor">
- <title>Declaring Parameters in the Descriptor</title>
-
- <para>The example descriptor
- <literal>descriptors/tutorial/ex2/RoomNumberAnnotator.xml</literal> is
- the same as the descriptor from the previous section except that information has
- been filled in for the Parameters and Parameter Settings pages of the Component
- Descriptor Editor.</para>
-
- <para>First, in Eclipse, open example two's RoomNumberAnnotator in the
- Component Descriptor Editor, and then go to the Parameters page (click on the
- parameters tab at the bottom of the window), which is shown below:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image020.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of UIMA Component Descriptor Editor (CDE) Parameters page</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>Two parameters – Patterns and Locations -- have been declared. In this
- screen shot, the mouse (not shown) is hovering over Patterns to show its
- description in the small popup window. Every parameter has the following
- information associated with it:</para>
-
- <itemizedlist><listitem><para>name – the name by which the annotator code
- refers to the parameter</para></listitem>
-
- <listitem><para>description – a natural language description of the
- intent of the parameter</para></listitem>
-
- <listitem><para>type – the data type of the parameter's value
- – must be one of String, Integer, Float, or Boolean.</para></listitem>
-
- <listitem><para>multiValued – true if the parameter can take
- multiple-values (an array), false if the parameter takes only a single value.
- Shown above as <literal>Multi</literal>.</para></listitem>
-
- <listitem><para>mandatory – true if a value must be provided for the
- parameter. Shown above as <literal>Req</literal> (for required). </para>
- </listitem></itemizedlist>
-
- <para>Both of our parameters are mandatory and accept an array of Strings as their
- value.</para>
-
- <para>Next, default values are assigned to the parameters on the Parameter Settings
- page:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image022.jpg"/>
- </imageobject>
- <textobject><phrase>Screenshot of UIMA Component Descriptor Editor (CDE) Parameter Settings page</phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>Here the <quote>Patterns</quote> parameter is selected, and the right pane
- shows the list of values for this parameter, in this case the regular expressions
- that match particular room numbering conventions. Notice the third pattern is
- new, for matching the style of room numbers in the third building, which has room
- numbers such as <literal>J2-A11</literal>.</para>
- </section>
- <section id="ugr.tug.aae.accessing_parameter_values_from_annotator">
- <title>Accessing Parameter Values from the Annotator Code</title>
-
- <para>The class
- <literal>org.apache.uima.tutorial.ex2.RoomNumberAnnotator</literal> has
- overridden the initialize method. The initialize method is called by the UIMA
- framework when the annotator is instantiated, so it is a good place to read
- configuration parameter values. The default initialize method does nothing with
- configuration parameters, so you have to override it. To see the code in Eclipse,
- switch to the src folder, and open
- <literal>org.apache.uima.tutorial.ex2</literal>. Here is the method
- body:</para>
-
-
- <programlisting>/**
-* @see AnalysisComponent#initialize(UimaContext)
-*/
-public void initialize(UimaContext aContext)
- throws ResourceInitializationException {
- super.initialize(aContext);
-
- // Get config. parameter values
- String[] patternStrings =
- (String[]) aContext.getConfigParameterValue("Patterns");
- mLocations =
- (String[]) aContext.getConfigParameterValue("Locations");
-
- // compile regular expressions
- mPatterns = new Pattern[patternStrings.length];
- for (int i = 0; i < patternStrings.length; i++) {
- mPatterns[i] = Pattern.compile(patternStrings[i]);
- }
-}</programlisting>
-
- <para>Configuration parameter values are accessed through the UimaContext. As you
- will see in subsequent sections of this chapter, the UimaContext is the
- annotator's access point for all of the facilities provided by the UIMA
- framework – for example logging and external resource access.</para>
-
- <para>The UimaContext's <literal>getConfigParameterValue</literal>
- method takes the name of the parameter as an argument; this must match one of the
- parameters declared in the descriptor. The return value of this method is a Java
- Object, whose type corresponds to the declared type of the parameter. It is up to the
- annotator to cast it to the appropriate type, String[] in this case.</para>
-
- <para>If there is a problem retrieving the parameter values, the framework throws an
- exception. Generally annotators don't handle these, and just let them
- propagate up.</para>
-
- <para>To see the configuration parameters working, run the Document Analyzer
- application and select the descriptor
- <literal>examples/descriptors/tutorial/ex2/RoomNumberAnnotator.xml</literal>
- . In the example document <literal>WatsonConferenceRooms.txt</literal>, you
- should see some examples of Hawthorne II room numbers that would not have been
- detected by the ex1 version of RoomNumberAnnotator.</para>
- </section>
-
- <section id="ugr.tug.aae.supporting_reconfiguration">
- <title>Supporting Reconfiguration</title>
-
- <para>If you take a look at the Javadocs (located in the <ulink
- url="api/index.html">docs/api</ulink> directory) for
- <literal>org.apache.uima.analysis_component.AnaysisComponent</literal>
- (which our annotator implements indirectly through JCasAnnotator_ImplBase),
- you will see that there is a reconfigure() method, which is called by the containing
- application through the UIMA framework, if the configuration parameter values
- are changed.</para>
-
- <para>The AnalysisComponent_ImplBase class provides a default implementation
- that just calls the annotator's destroy method followed by its initialize
- method. This works fine for our annotator. The only situation in which you might
- want to override the default reconfigure() is if your annotator has very expensive
- initialization logic, and you don't want to reinitialize everything if just
- one configuration parameter has changed. In that case, you can provide a more
- intelligent implementation of reconfigure() for your annotator.</para>
-
- </section>
-
- <section id="ugr.tug.aae.configuration_parameter_groups">
- <title>Configuration Parameter Groups</title>
-
- <para>For annotators with many sets of configuration parameters, UIMA supports
- organizing them into groups. It is possible to define a parameter with the same name
- in multiple groups; one common use for this is for annotators that can process
- documents in several languages and which want to have different parameter
- settings for the different languages.</para>
-
- <para>The syntax for defining parameter groups in your descriptor is fairly
- straightforward – see <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor"/> for details. Values of
- parameters defined within groups are accessed through the two-argument version
- of <literal>UimaContext.getConfigParameterValue</literal>, which takes
- both the group name and the parameter name as its arguments.</para>
- </section>
- </section>
-
- <section id="ugr.tug.aae.logging">
- <title>Logging</title>
-
- <para>The UIMA SDK provides a logging facility, which is very similar to the
- java.util.logging.Logger class that was introduced in Java 1.4.</para>
-
- <para>In the Java architecture, each logger instance is associated with a name. By
- convention, this name is often the fully qualified class name of the component
- issuing the logging call. The name can be referenced in a configuration file when
- specifying which kinds of log messages to actually log, and where they should
- go.</para>
-
- <para>The UIMA framework supports this convention using the
- <literal>UimaContext</literal> object. If you access a logger instance using
- <literal>getContext().getLogger()</literal> within an Annotator, the logger
- name will be the fully qualified name of the Annotator implementation class.</para>
-
- <para>Here is an example from the process method of
- <literal>org.apache.uima.tutorial.ex2.RoomNumberAnnotator</literal>:
-
-
- <programlisting>getContext().getLogger().log(Level.FINEST,"Found: " + annotation);</programlisting>
- </para>
-
- <para>The first argument to the log method is the level of the log output. Here, a value of
- FINEST indicates that this is a highly-detailed tracing message. While useful for
- debugging, it is likely that real applications will not output log messages at this
- level, in order to improve their performance. Other defined levels, from lowest to
- highest importance, are FINER, FINE, CONFIG, INFO, WARNING, and SEVERE.</para>
-
- <para>If no logging configuration file is provided (see next section), the Java
- Virtual Machine defaults would be used, which typically set the level to INFO and
- higher messages, and direct output to the console.</para>
-
- <para>If you specify the standard UIMA SDK <literal>Logger.properties,</literal>
- the output will be directed to a file named uima.log, in the current working directory
- (often the <quote>project</quote> directory when running from Eclipse, for
- instance).</para> <note><para>When using Eclipse, the uima.log file, if written
- into the Eclipse workspace in the project uimaj-examples, for example, may not appear
- in the Eclipse package explorer view until you right-click the uimaj-examples project
- with the mouse, and select <quote>Refresh</quote>. This operation refreshes the
- Eclipse display to conform to what may have changed on the file system. Also, you can set
- the Eclipse preferences for the workspace to automatically refresh (Window →
- Preferences → General → Workspace, then click the <quote>refresh
- automatically</quote> checkbox.</para></note>
-
- <section id="ugr.tug.aae.logging.configuring">
- <title>Specifying the Logging Configuration</title>
-
- <para>The standard UIMA logger uses the underlying Java 1.4 logging mechanism. You
- can use the APIs that come with that to configure the logging. In addition, the
- standard Java 1.4 logging initialization mechanisms will look for a Java System
- Property named <literal>java.util.logging.config.file</literal> and if
- found, will use the value of this property as the name of a standard
- <quote>properties</quote> file, for setting the logging level. Please refer to
- the Java 1.4. documentation for more information on the format and use of this
- file.</para>
-
- <para>Two sample logging specification property files can be found in the UIMA_HOME
- directory where the UIMA SDK is installed:
- <literal>config/Logger.properties</literal>, and
- <literal>config/FileConsoleLogger.properties</literal>. These specify the same
- logging, except the first logs just to a file, while the second logs both to a file and
- to the console. You can edit these files, or create additional ones, as described
- below, to change the logging behavior.</para>
-
- <para>When running your own Java application, you can specify the location of the
- logging configuration file on your Java command line by setting the Java system
- property <literal>java.util.logging.config.file</literal> to be the logging
- configuration filename. This file specification can be either absolute or
- relative to the working directory. For example:
-
-
- <programlisting><?db-font-size 65% ?>java "-Djava.util.logging.config.file=C:/Program Files/apache-uima/config/Logger.properties"</programlisting>
- <note><para>In a shell script, you can use environment variables such as
- UIMA_HOME if convenient.</para></note> </para>
-
- <para>If you are using Eclipse to launch your application, you can set this property
- in the VM arguments section of the Arguments tab of the run configuration screen. If
- you've set an environment variable UIMA_HOME, you could for example, use the
- string:
- <literal>"-Djava.util.logging.config.file=${env_var:UIMA_HOME}/config/Logger.properties".</literal>
- </para>
-
- <para>If you running the .bat or .sh files in the UIMA SDK's <literal>bin</literal> directory, you can specify the location of your
- logger configuration file by setting the <literal>UIMA_LOGGER_CONFIG_FILE</literal> environment variable prior to running the script,
- for example (on Windows):
-
- <programlisting><?db-font-size 70% ?>set UIMA_LOGGER_CONFIG_FILE=C:/myapp/MyLogger.properties</programlisting>
- </para>
- </section>
-
- <section id="ugr.tug.aae.logging.setting_logging_levels">
- <title>Setting Logging Levels</title>
-
- <para>Within the logging control file, the default global logging level specifies
- which kinds of events are logged across all loggers. For any given facility this
- global level can be overridden by a facility specific level. Multiple handlers are
- supported. This allows messages to be directed to a log file, as well as to a
- <quote>console</quote>. Note that the ConsoleHandler also has a separate level
- setting to limit messages printed to the console. For example: <literal>.level=
- INFO</literal> </para>
-
- <para>The properties file can change where the log is written, as well.</para>
-
- <para>Facility specific properties allow different logging for each class, as
- well. For example, to set the com.xyz.foo logger to only log SEVERE messages:
- <literal>com.xyz.foo.level = SEVERE</literal></para>
-
- <para>If you have a sample annotator in the package
- <literal>org.apache.uima.SampleAnnotator</literal> you can set the log level
- by specifying: <literal>org.apache.uima.SampleAnnotator.level =
- ALL</literal></para>
-
- <para>There are other logging controls; for a full discussion, please read the
- contents of the <literal>Logger.properties</literal> file and the Java
- specification for logging in Java 1.4.</para>
- </section>
-
- <section id="ugr.tug.aae.logging.output_format">
- <title>Format of logging output</title>
-
- <para>The logging output is formatted by handlers specified in the properties file
- for configuring logging, described above. The default formatter that comes with
- the UIMA SDK formats logging output as follows:</para>
-
- <para><literal>Timestamp - threadID: sourceInfo: Message level:
- message</literal></para>
-
- <para> Here's an example:</para>
-
- <para><literal>7/12/04 2:15:35 PM - 10:
- org.apache.uima.util.TestClass.main(62): INFO: You are not logged
- in!</literal></para>
- </section>
-
- <section id="ugr.tug.aae.logging.meaning_of_severity_levels">
- <title>Meaning of the logging severity levels</title>
-
- <para>These levels are defined by the Java logging framework, which was
- incorporated into Java as of the 1.4 release level. The levels are defined in the
- Javadocs for java.util.logging.Level, and include both logging and tracing
- levels:
- <itemizedlist spacing="compact">
- <listitem><para>OFF is a special level that can be used to turn off
- logging.</para></listitem>
-
- <listitem><para>ALL indicates that all messages should be logged. </para>
- </listitem>
-
- <listitem><para>CONFIG is a message level for configuration messages. These
- would typically occur once (during configuration) in methods like
- <literal>initialize()</literal>. </para></listitem>
-
- <listitem><para>INFO is a message level for informational messages, for
- example, connected to server IP: 192.168.120.12 </para></listitem>
-
- <listitem><para>WARNING is a message level indicating a potential
- problem.</para></listitem>
-
- <listitem><para>SEVERE is a message level indicating a serious
- failure.</para></listitem>
- </itemizedlist></para>
-
- <para> Tracing levels, typically used for debugging:
- <itemizedlist>
-
- <listitem><para>FINE is a message level providing tracing information,
- typically at a collection level (messages occurring once per collection).
- </para></listitem>
-
- <listitem><para>FINER indicates a fairly detailed tracing message,
- typically at a document level (once per document).</para></listitem>
-
- <listitem><para>FINEST indicates a highly detailed tracing message. </para>
- </listitem></itemizedlist></para>
- </section>
-
- <section id="ugr.tug.aae.logging.using_outside_of_an_annotator">
- <title>Using the logger outside of an annotator</title>
-
- <para>An application using UIMA may want to log its messages using the same logging
- framework. This can be done by getting a reference to the UIMA logger, as follows:
-
-
- <programlisting>Logger logger = UIMAFramework.getLogger(TestClass.class);</programlisting>
- </para>
-
- <para>The optional class argument allows filtering by class (if the log handler
- supports this). If not specified, the name of the returned logger instance is
- <quote>org.apache.uima</quote>.</para>
- </section>
-
- <section id="ugr.tug.aae.logging.change_logger_implementation">
- <title>Changing the underlying UIMA logging implementation</title>
-
- <para>By default the UIMA framework use, under the hood of the UIMA Logger interface, the Java logging framework
- to do logging. But it is possible to change the logging implementation that UIMA use from Java logging to
- an arbitrary logging system when specifying the system property
- <programlisting>-Dorg.apache.uima.logger.class=<loggerClass></programlisting>
- when the UIMA framework is started.
- </para>
- <para>
- The specified logger class must be available in the classpath and have to implement the
- <code>org.apache.uima.util.Logger</code> interface.
- </para>
-
- <para>
- UIMA also provides a logging implementation that use Apache Log4j instead of Java logging. To
- use Log4j you have to provide the Log4j jars in the classpath and your application
- must specify the logging configuration as shown below.
- <programlisting><?db-font-size 80% ?>-Dorg.apache.uima.logger.class=<org.apache.uima.util.impl.Log4jLogger_impl></programlisting>
- </para>
- </section>
-
-
- </section>
- </section>
- <section id="ugr.tug.aae.building_aggregates">
- <title>Building Aggregate Analysis Engines</title>
-
- <section id="ugr.tug.aae.combining_annotators">
- <title>Combining Annotators</title>
-
- <para>The UIMA SDK makes it very easy to combine any sequence of Analysis Engines to
- form an <emphasis>Aggregate Analysis Engine</emphasis>. This is done through an
- XML descriptor; no Java code is required!</para>
-
- <para>If you go to the <literal>examples/descriptors/tutorial/ex3</literal>
- folder (in Eclipse, it's in your uimaj-examples project, under the
- <literal>descriptors/tutorial/ex3</literal> folder), you will find a
- descriptor for a TutorialDateTime annotator. This annotator detects dates and
- times (and also sentences and words). To see what this annotator can do, try it out
- using the Document Analyzer. If you are curious as to how this annotator works, the
- source code is included, but it is not necessary to understand the code at this
- time.</para>
-
- <para>We are going to combine the TutorialDateTime annotator with the
- RoomNumberAnnotator to create an aggregate Analysis Engine. This is illustrated
- in the following figure:
-
- <figure id="ugr.tug.aae.fig.combining_annotators">
- <title>Combining Annotators to form an Aggregate Analysis Engine</title>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="PNG"
- fileref="&imgroot;image024.png"/>
- </imageobject>
- <textobject> <phrase>Combining Annotators to form an Aggregate Analysis
- Engine</phrase>
- </textobject>
- </mediaobject>
- </figure> </para>
-
- <para>The descriptor that does this is named
- <literal>RoomNumberAndDateTime.xml</literal>, which you can open in the
- Component Descriptor Editor plug-in. This is in the uimaj-examples project in the
- folder <literal>descriptors/tutorial/ex3</literal>. </para>
-
- <para>The <quote>Aggregate</quote> page of the Component Descriptor Editor is
- used to define which components make up the aggregate. A screen shot is shown below.
- (If you are not using Eclipse, see <xref
- linkend="ugr.tug.aae.xml_intro_ae_descriptor"/> for the actual XML syntax
- for Aggregate Analysis Engine Descriptors.)</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image026.jpg"/>
- </imageobject>
- <textobject>
- <phrase>Aggregate page of the Component Descriptor Editor (CDE)</phrase>
- </textobject>
- </mediaobject>
-</screenshot>
-
- <para>On the left side of the screen is the list of component engines that make up the
- aggregate – in this case, the TutorialDateTime annotator and the
- RoomNumberAnnotator. To add a component, you can click the <quote>Add</quote>
- button and browse to its descriptor. You can also click the <quote>Find AE</quote>
- button and search for an Analysis Engine in your Eclipse workspace.
- <note><para>The <quote>AddRemote</quote> button is used for adding components
- which run remotely (for example, on another machine using a remote networking
- connection). This capability is described in section <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application.how_to_call_a_uima_service"/>,</para>
- </note> </para>
-
- <para>The order of the components in the left pane does not imply an order of
- execution. The order of execution, or <quote>flow</quote> is determined in the
- <quote>Component Engine Flow</quote> section on the right. UIMA supports
- different types of algorithms (including user-definable) for determining the
- flow. Here we pick the simplest: <literal>FixedFlow</literal>. We have chosen to
- have the RoomNumberAnnotator execute first, although in this case it
- doesn't really matter, since the RoomNumber and DateTime annotators do not
- have any dependencies on one another.</para>
-
- <para>If you look at the <quote>Type System</quote> page of the Component
- Descriptor Editor, you will see that it displays the type system but is not
- editable. The Type System of an Aggregate Analysis Engine is automatically
- computed by merging the Type Systems of all of its components.</para>
-
- <warning><para>If the components have different definitions for the same type name,
- The Component Descriptor Editor will show a warning. It is possible to continue past
- this warning, in which case your aggregate's type system will have the correct
- <quote>merged</quote>
- type definition that contains all of the features defined on that type by all of your
- components. However, it is not recommended to use this feature in conjunction with JCAS,
- since the JCAS Java Class definitions cannot be so easily merged. See
- <olink
- targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.jcas.merging_types_from_other_specs"/> for more information.
- </para></warning>
-
- <para>The Capabilities page is where you explicitly declare the aggregate Analysis
- Engine's inputs and outputs. Sofas and Languages are described later.
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image028.jpg"/>
- </imageobject>
- <textobject><phrase>Screen shot of the Capabilities page of the Component Descriptor Editor
- </phrase></textobject>
- </mediaobject>
- </screenshot>
- </para>
- <para>Note that it is not automatically assumed that all outputs of each component
- Analysis Engine (AE) are passed through as outputs of the aggregate AE. In this
- case, for example, we have decided to suppress the Word and Sentence annotations
- that are produced by the TutorialDateTime annotator.</para>
-
- <para>You can run this AE using the Document Analyzer in the same way that you run any
- other AE. Just select the <literal>examples/descriptors/tutorial/ex3/
- RoomNumberAndDateTime.xml</literal> descriptor and click the Run button. You
- should see that RoomNumbers, Dates, and Times are all shown but that Words and
- Sentences are not:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image030.jpg"/>
- </imageobject>
- <textobject><phrase>Screen shot results of running the Document Analyzer
- </phrase></textobject>
- </mediaobject>
- </screenshot>
-
- </section>
-
- <section id="ugr.tug.aae.aaes_can_contain_cas_consumers">
- <title>AEs can also contain CAS Consumers</title>
-
- <para>In addition to aggregating Analysis Engines, Aggregates can also contain CAS
- Consumers (see <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.cpe"/>, or even a mixture of these components with regular
- Analysis Engines. The UIMA Examples has an example of an Aggregate which contains
- both an analysis engine and a CAS consumer, in
- <literal>examples/descriptors/MixedAggregate.xml.</literal></para>
-
- <para>Analysis Engines support the <literal>collectionProcessComplete</literal>
- method, which is particularly important for many CAS Consumers. If
- an application (or a Collection Processing Engine) calls
- <literal>collectionProcessComplete</literal> no an aggregate, the framework
- will deliver that call to all of the components of the aggregate. If you use
- one of the built-in flow types (fixedFlow or capabilityLanguageFlow), then the
- order specified in that flow will be the same order in which the
- <literal>collectionProcessComplete</literal> calls are made to the components.
- If a custom flow is used, then the calls will be made in arbitrary order.
- </para>
- </section>
-
- <section id="ugr.tug.aae.reading_results_previous_annotators">
- <title>Reading the Results of Previous Annotators</title>
-
- <para>So far, we have been looking at annotators that look directly at the document text. However, annotators
- can also use the results of other annotators. One useful thing we can do at this point is look for the
- co-occurrence of a Date, a RoomNumber, and two Times – and annotate that as a Meeting.</para>
-
- <para>The CAS maintains <emphasis>indexes</emphasis> of annotations, and from an index you can obtain an
- iterator that allows you to step through all annotations of a particular type. Here's some example code
- that would iterate over all of the TimeAnnot annotations in the JCas:
-
-
- <programlisting>FSIndex timeIndex = aJCas.getAnnotationIndex(TimeAnnot.type);
-Iterator timeIter = timeIndex.iterator();
-while (timeIter.hasNext()) {
- TimeAnnot time = (TimeAnnot)timeIter.next();
-
- //do something
-}</programlisting></para>
-
- <note>
- <para>You can also use the method
- <literal>JCAS.getJFSIndexRepository().getAllIndexedFS(YourClass.type)</literal>, which returns an iterator
- over all instances of <literal>YourClass</literal> in no particular order. This can be useful for types
- that are not subtypes of the built-in Annotation type and which therefore have no default sort order.</para>
-
- <para>Also, if you've defined your own custom index as described in <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor.aes.index"/>, you can get an iterator over that
- specific index by calling <literal>aJCas.getJFSIndexRepository().getIndex(label)</literal>.
- The <literal>getIndex(...)</literal> method has also a 2 argument form; the second argument,
- if used, specialized the index to subtype of the type the index was declared to index. For instance,
- if you defined an index called "allEvents" over the type <literal>Event</literal>, and wanted
- to get an index over just a particular subtype of event, say, <literal>TimeEvent</literal>,
- you can ask for that index using
- <literal>aJCas.getJFSIndexRepository().getIndex("allEvents", TimeEvent.type)</literal>.</para></note>
-
- <para>Now that we've explained the basics, let's take a look at the process method for
- <literal>org.apache.uima.tutorial.ex4.MeetingAnnotator</literal>. Since we're looking for a
- combination of a RoomNumber, a Date, and two Times, there are four nested iterators. (There's surely a
- better algorithm for doing this, but to keep things simple we're just going to look at every combination
- of the four items.)</para>
-
- <para>For each combination of the four annotations, we compute the span of text that includes all of them, and
- then we check to see if that span is smaller than a <quote>window</quote> size, a configuration parameter.
- There are also some checks to make sure that we don't annotate the same span of text multiple times. If all
- the checks pass, we create a Meeting annotation over the whole span. There's really nothing to
- it!</para>
-
- <para>The XML descriptor, located in
- <literal>examples/descriptors/tutorial/ex4/MeetingAnnotator.xml</literal> , is also very
- straightforward. An important difference from previous descriptors is that this is the first annotator
- we've discussed that has input requirements. This can be seen on the <quote>Capabilities</quote>
- page of the Component Descriptor Editor:</para>
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="JPG" fileref="&imgroot;image032.jpg"/>
- </imageobject>
- <textobject><phrase>Screen shot of Capabilities page of the Component Descriptor Editor
- </phrase></textobject>
- </mediaobject>
- </screenshot>
-
- <para>If we were to run the MeetingAnnotator on its own, it wouldn't detect anything because it
- wouldn't have any input annotations to work with. The required input annotations can be produced by the
- RoomNumber and DateTime annotators. So, we create an aggregate Analysis Engine containing these two
- annotators, followed by the Meeting annotator. This aggregate is illustrated in <xref
- linkend="ugr.tug.aae.fig.aggregate_for_meeting_annotator"/>. The descriptor for this is in
- <literal>examples/descriptors/tutorial/ex4/MeetingDetectorAE.xml</literal> . Give it a try in the
- Document Analyzer.
-
- <figure id="ugr.tug.aae.fig.aggregate_for_meeting_annotator">
- <title>An Aggregate Analysis Engine where an internal component uses output from previous
- engines</title>
- <mediaobject>
- <imageobject>
- <imagedata width="5.7in" format="PNG" fileref="&imgroot;image034.png"/>
- </imageobject>
- <textobject><phrase>An Aggregate Analysis Engine where an internal component uses output from
- previous engines. </phrase>
- </textobject>
- </mediaobject>
- </figure> </para>
-
- </section>
- </section>
-
- <section id="ugr.tug.aae.other_examples">
- <title>Other examples</title>
-
- <para>The UIMA SDK include several other examples you may find interesting,
- including</para>
-
- <itemizedlist spacing="compact">
- <listitem><para>SimpleTokenAndSentenceAnnotator – a simple tokenizer and
- sentence annotator.</para></listitem>
-
- <listitem><para>XmlDetagger – A multi-sofa annotator that does XML
- detagging. Multiple Sofas (Subjects of Analysis) are described in a later –
- see <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.mvs"/>. Reads XML data from the input Sofa
- (named "xmlDocument"); this data can be stored in the CAS as a string or array, or it can
- be a URI to a remote file. The XML is parsed using the JVM's default parser, and the
- plain-text content is written to a new sofa called "plainTextDocument".</para>
- </listitem>
-
- <listitem><para>PersonTitleDBWriterCasConsumer – a sample CAS Consumer
- which populates a relational database with some annotations. It uses JDBC and in this
- example, hooks up with the Open Source Apache Derby database. </para></listitem>
- </itemizedlist>
- </section>
-
- <section id="ugr.tug.aae.additional_topics">
- <title>Additional Topics</title>
-
- <section id="ugr.tug.aae.contract_for_annotator_methods">
- <title>Contract: Annotator Methods Called by the Framework</title>
- <titleabbrev>Annotator Methods</titleabbrev>
-
- <para>The UIMA framework ensures that an Annotator instance is called by only one
- thread at a time. An instance never has to worry about running some method on one
- thread, and then asynchronously being called using another thread. This approach
- simplifies the design of annotators – they do not have to be designed to support
- multi-threading. When multiple threading is wanted, for performance, multiple
- instances of the Annotator are created, each one running on just one thread.</para>
-
- <para>The following table defines the methods called by the framework, when they are
- called, and the requirements annotator implementations must follow.</para>
-
- <informaltable frame="all">
- <tgroup cols="3" colsep="1" rowsep="1">
- <colspec colname="c1"/>
- <colspec colname="c2"/>
- <colspec colname="c3"/>
- <thead>
- <row>
- <entry align="center">Method</entry>
- <entry align="center">When Called by Framework</entry>
- <entry align="center">Requirements</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>initialize</entry>
- <entry>Typically only called once, when instance is created. Can be called
- again if application does a reinitialize call and the default behavior
- isn't overridden (the default behavior for reinitialize is to call
- <literal>destroy</literal> followed by
- <literal>initialize</literal></entry>
- <entry>Normally does one-time initialization, including reading of
- configuration parameters. If the application changes the parameters, it
- can call initialize to have the annotator re-do its
- initialization.</entry>
- </row>
- <row>
- <entry>typeSystemInit</entry>
- <entry>Called before <literal>process</literal> whenever the type system
- in the CAS being passed in differs from what was previously passed in a
- <literal>process</literal> call (and called for the first CAS passed in,
- too). The Type System being passed to an annotator only changes in the case of
- remote annotators that are active as servers, receiving possibly
- different type systems to operate on.</entry>
- <entry>Typically, users of JCas do not implement any method for this. An
- annotator can use this call to read the CAS type system and setup any instance
- variables that make accessing the types and features convenient.</entry>
- </row>
- <row>
- <entry>process</entry>
- <entry>Called once for each CAS. Called by the application if not using
- Collection Processing Manager (CPM); the application calls the process
- method on the analysis engine, which is then delegated by the framework to
- all the annotators in the engine. For Collection Processing application,
- the CPM calls the process method. If the application creates and manages
- your own Collection Processing Engine via API calls (see Javadocs), the
- application calls this on the Collection Processing Engine, and it is
- delegated by the framework to the components.</entry>
- <entry>Process the CAS, adding and/or modifying elements in it</entry>
- </row>
- <row>
- <entry>destroy</entry>
- <entry>This method can be called by applications, and is also called by the
- Collection Processing Manager framework when the collection processing
- completes. If called by an application on the Engine object, it is
- propagated to all contained annotators.</entry>
- <entry>An annotator should release all resources, close files, close
- database connections, etc., and return to a state where another initialize
- call could be received to restart. Typically, after a destroy call, no
- further calls will be made to an annotator instance.</entry>
- </row>
- <row>
- <entry>reconfigure</entry>
- <entry><para>This method is never called by the framework, unless an
- application calls it on the Engine object – in which case it the
- framework propagates it to all annotators contained in the Engine.</para>
- <para>Its purpose is to signal that the configuration parameters have
- changed.</para></entry>
- <entry>A default implementation of this calls destroy, followed by
- initialize. This is the only case where initialize would be called more than
- once. Users should implement whatever logic is needed to return the
- annotator to an initialized state, including re-reading the
- configuration parameter data.</entry>
- </row>
- </tbody>
- </tgroup>
- </informaltable>
-
- </section>
-
- <section id="ugr.tug.aae.reporting_errors_from_annotators">
- <title>Reporting errors from Annotators</title>
-
- <para>There are two broad classes of errors that can occur: recoverable and
- unrecoverable. Because Annotators are often expected to process very large numbers
- of artifacts (for example, text documents), they should be written to recover where
- possible.</para>
-
- <para>For example, if an upstream annotator created some input for an annotator which
- is invalid, the annotator may want to log this event, ignore the bad input and
- continue. It may include a notification of this event in the CAS, for further
- downstream annotators to consider. Or, it may throw an exception (see next section)
- – but in this case, it cannot do any further processing on that
- document.</para> <note><para>The choice of what to do can be made configurable,
- using the configuration parameters. </para></note>
-
- </section>
-
- <section id="ugr.tug.aae.throwing_exceptions_from_annotators">
- <title>Throwing Exceptions from Annotators</title>
-
- <para>Let's say an invalid regular expression was passed as a parameter to the
- RoomNumberAnnotator. Because this is an error related to the overall
- configuration, and not something we could expect to ignore, we should throw an
- appropriate exception, and most Java programmers would expect to do so like
- this:</para>
-
-
- <programlisting>throw new ResourceInitializationException(
- "The regular expression " + x + " is not valid.");</programlisting>
-
- <para>UIMA, however, does not do it this way. All UIMA exceptions are
- <emphasis>internationalized</emphasis>, meaning that they support translation
- into other languages. This is accomplished by eliminating hardcoded message
- strings and instead using external message digests. Message digests are files
- containing (key, value) pairs. The key is used in the Java code instead of the actual
- message string. This allows the message string to be easily translated later by
- modifying the message digest file, not the Java code. Also, message strings in the
- digest can contain parameters that are filled in when the exception is thrown. The
- format of the message digest file is described in the Javadocs for the Java class
- <literal>java.util.PropertyResourceBundle</literal> and in the load method of
- <literal>java.util.Properties</literal>.</para>
-
- <para>The first thing an annotator developer must choose is what Exception class to
- use. There are three to choose from:
-
- <orderedlist><listitem><para>ResourceConfigurationException should be
- thrown from the annotator's initialize() method if invalid configuration
- parameter values have been specified.</para></listitem>
-
- <listitem><para>ResourceInitializationException should be thrown from the
- annotator's initialize() method if initialization fails for some other
- reason.</para></listitem>
-
- <listitem><para>AnalysisEngineProcessException should be thrown from the
- annotator's process() method if the processing of a particular document
- fails for any reason. </para></listitem></orderedlist></para>
-
- <para>Generally you will not need to define your own custom exception classes, but if
- you do they must extend one of these three classes, which are the only types of
- Exceptions that the annotator interface permits annotators to throw.</para>
-
- <para>All of the UIMA Exception classes share common constructor varieties. There are
- four possible arguments:</para>
-
- <para>The name of the message digest to use (optional – if not specified the
- default UIMA message digest is used).</para>
-
- <para>The key string used to select the message in the message digest.</para>
-
- <para>An object array containing the parameters to include in the message. Messages
- can have substitutable parts. When the message is given, the string representation
- of the objects passed are substituted into the message. The object array is often
- created using the syntax new Object[]{x, y}.</para>
-
- <para>Another exception which is the <quote>cause</quote> of the exception you are
- throwing. This feature is commonly used when you catch another exception and rethrow
- it. (optional)</para>
-
- <para>If you look at source file (folder: src in Eclipse)
- <literal>org.apache.uima.tutorial.ex5.RoomNumberAnnotator</literal>, you
- will see the following code:
-
-
- <programlisting>try {
- mPatterns[i] = Pattern.compile(patternStrings[i]);
-}
-catch (PatternSyntaxException e) {
- throw new ResourceInitializationException(
- MESSAGE_DIGEST, "regex_syntax_error",
- new Object[]{patternStrings[i]}, e);
-}</programlisting>
- where the MESSAGE_DIGEST constant has the value <literal>
- "org.apache.uima.tutorial.ex5.RoomNumberAnnotator_Messages". </literal>
- </para>
-
[... 3605 lines stripped ...]