You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2010/05/06 16:06:04 UTC
svn commit: r941744 [2/7] - in
/uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides: ./
src/ src/docbook/ src/docbook/images/
src/docbook/images/tutorials_and_users_guides/
src/docbook/images/tutorials_and_users_guides/tug.aae/ src/d...
Added: uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/annotator_analysis_engine_guide.xml
URL: http://svn.apache.org/viewvc/uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/annotator_analysis_engine_guide.xml?rev=941744&view=auto
==============================================================================
--- uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/annotator_analysis_engine_guide.xml (added)
+++ uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/annotator_analysis_engine_guide.xml Thu May 6 14:06:02 2010
@@ -0,0 +1,2607 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+<!ENTITY imgroot "images/tutorials_and_users_guides/tug.aae/">
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent">
+%uimaents;
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<chapter id="ugr.tug.aae">
+ <title>Annotator and Analysis Engine Developer's Guide</title>
+ <titleabbrev>Annotator & AE Developer's Guide</titleabbrev>
+
+ <para>This chapter describes how to develop UIMA <emphasis>type systems</emphasis>,
+ <emphasis>Annotators</emphasis> and <emphasis>Analysis Engines</emphasis> using
+ the UIMA SDK. It is helpful to read the UIMA Conceptual Overview chapter for a review on
+ these concepts.</para>
+
+ <para>An <emphasis>Analysis Engine (AE)</emphasis> is a program that analyzes artifacts
+ (e.g. documents) and infers information from them.</para>
+
+ <para>Analysis Engines are constructed from building blocks called
+ <emphasis>Annotators</emphasis>. An annotator is a component that contains analysis
+ logic. Annotators analyze an artifact (for example, a text document) and create
+ additional data (metadata) about that artifact. It is a goal of UIMA that annotators need
+ not be concerned with anything other than their analysis logic – for example the
+ details of their deployment or their interaction with other annotators.</para>
+
+ <para>An Analysis Engine (AE) may contain a single annotator (this is referred to as a
+ <emphasis>Primitive AE)</emphasis>, or it may be a composition of others and therefore
+ contain multiple annotators (this is referred to as an <emphasis>Aggregate
+ AE</emphasis>). Primitive and aggregate AEs implement the same interface and can be used
+ interchangeably by applications.</para>
+
+ <para>Annotators produce their analysis results in the form of typed <emphasis>Feature
+ Structures</emphasis>, which are simply data structures that have a type and a set of
+ (attribute, value) pairs. An <emphasis>annotation</emphasis> is a particular type of
+ Feature Structure that is attached to a region of the artifact being analyzed (a span of
+ text in a document, for example).</para>
+
+ <para>For example, an annotator may produce an Annotation over the span of text
+ <literal>President Bush</literal>, where the type of the Annotation is
+ <literal>Person</literal> and the attribute <literal>fullName</literal> has the
+ value <literal>George W. Bush</literal>, and its position in the artifact is character
+ position 12 through character position 26.</para>
+
+ <para>It is also possible for annotators to record information associated with the entire
+ document rather than a particular span (these are considered Feature Structures but not
+ Annotations).</para>
+
+ <para>All feature structures, including annotations, are represented in the UIMA
+ <emphasis>Common Analysis Structure(CAS)</emphasis>. The CAS is the central data
+ structure through which all UIMA components communicate. Included with the UIMA SDK is an
+ easy-to-use, native Java interface to the CAS called the <emphasis>JCas</emphasis>.
+ The JCas represents each feature structure as a Java object; the example feature
+ structure from the previous paragraph would be an instance of a Java class Person with
+ getFullName() and setFullName() methods. Though the examples in this guide all use the
+ JCas, it is also possible to directly access the underlying CAS system; for more
+ information see <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>
+ .</para>
+
+ <para>The remainder of this chapter will refer to the analysis of text documents and the
+ creation of annotations that are attached to spans of text in those documents. Keep in mind
+ that the CAS can represent arbitrary types of feature structures, and feature structures
+ can refer to other feature structures. For example, you can use the CAS to represent a parse
+ tree for a document. Also, the artifact that you are analyzing need not be a text
+ document.</para>
+
+ <para>This guide is organized as follows:</para>
+
+ <itemizedlist>
+ <listitem>
+ <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.getting_started"/></emphasis> is a
+ tutorial with step-by-step instructions for how to develop and test a simple UIMA annotator.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.configuration_logging"/>
+ </emphasis> discusses how to make your UIMA annotator configurable, and how it can write messages to the UIMA
+ log file.</para>
+ </listitem>
+ <listitem>
+ <para> <emphasis role="bold-italic"><xref linkend="ugr.tug.aae.building_aggregates"/></emphasis>
+ describes how annotators can be combined into aggregate analysis engines. It also describes how one
+ annotator can make use of the analysis results produced by an annotator that has run previously.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.other_examples"/></emphasis>
+ describes several other examples you may find interesting, including</para>
+
+ <itemizedlist spacing="compact">
+ <listitem>
+ <para>SimpleTokenAndSentenceAnnotator
+ – a simple tokenizer and sentence annotator.</para>
+ </listitem>
+
+ <listitem>
+ <para>PersonTitleDBWriterCasConsumer – a sample CAS Consumer which populates a relational
+ database with some annotations. It uses JDBC and in this example, hooks up with the Open Source Apache
+ Derby database. </para>
+ </listitem>
+ </itemizedlist>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.additional_topics"/></emphasis>
+ describes additional features of the UIMA SDK that may help you in building your own annotators and analysis
+ engines.</para>
+ </listitem>
+ <listitem>
+ <para><emphasis role="bold-italic"><xref linkend="ugr.tug.aae.common_pitfalls"/> </emphasis>
+ contains some useful guidelines to help you ensure that your annotators will work correctly in any UIMA
+ application.</para>
+ </listitem>
+ </itemizedlist>
+
+ <para>This guide does not discuss how to build UIMA Applications, which are programs that
+ use Analysis Engines, along with other components, e.g. a search engine, document store,
+ and user interface, to deliver a complete package of functionality to an end-user. For
+ information on application development, see <olink
+ targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.application"
+ xrefstyle="select: label quotedtitle"/>
+ .</para>
+
+ <section id="ugr.tug.aae.getting_started">
+ <title>Getting Started</title>
+
+ <para>This section is a step-by-step tutorial that will get you started developing UIMA
+ annotators. All of the files referred to by the examples in this chapter are in the
+ <literal>examples</literal> directory of the UIMA SDK. This directory is designed to
+ be imported into your Eclipse workspace; see <olink
+ targetdoc="&uima_docs_overview;"
+ targetptr="ugr.ovv.eclipse_setup.example_code"/> for instructions on how to do
+ this.
+ See <olink targetdoc="&uima_docs_overview;"
+ targetptr="ugr.ovv.eclipse_setup.linking_uima_javadocs"/> for how to attach the UIMA
+ Javadocs to the jar files.
+ Also you may wish to refer to the UIMA SDK Javadocs located in the <ulink
+ url="file:../../api/index.html">docs/api</ulink> directory.</para>
+
+ <note><para>In Eclipse 3.1, if you highlight a UIMA class or method defined in the UIMA SDK
+ Javadocs, you can conveniently have Eclipse open the corresponding Javadoc for that
+ class or method in a browser, by pressing Shift + F2.</para></note>
+ <note><para>If you downloaded the source distribution for UIMA, you can attach that as
+ well to the library Jar files; for information on how to do this, see
+ <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.javadocs"/>.</para></note>
+
+ <para>The example annotator that we are going to walk through will detect room numbers for
+ rooms where the room numbering scheme follows some simple conventions. In our example,
+ there are two kinds of patterns we want to find; here are some examples, together with
+ their corresponding regular expression patterns:
+ <variablelist>
+ <varlistentry>
+ <term>Yorktown patterns:</term>
+ <listitem><para>20-001, 31-206, 04-123(Regular Expression Pattern:
+ ##-[0-2]##)</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>Hawthorne patterns:</term>
+ <listitem><para>GN-K35, 1S-L07, 4N-B21 (Regular Expression Pattern:
+ [G1-4][NS]-[A-Z]##)</para></listitem>
+ </varlistentry>
+ </variablelist> </para>
+
+ <para>There are several steps to develop and test a simple UIMA annotator.</para>
+
+ <orderedlist spacing="compact"><listitem><para>Define the CAS types that the
+ annotator will use.</para></listitem>
+
+ <listitem><para>Generate the Java classes for these types.</para></listitem>
+
+ <listitem><para>Write the actual annotator Java code.</para></listitem>
+
+ <listitem><para>Create the Analysis Engine descriptor.</para></listitem>
+
+ <listitem><para>Test the annotator. </para></listitem></orderedlist>
+
+ <para>These steps are discussed in the next sections.</para>
+
+ <section id="ugr.tug.aae.defining_types">
+ <title>Defining Types</title>
+
+ <para>The first step in developing an annotator is to define the CAS Feature Structure
+ types that it creates. This is done in an XML file called a <emphasis>Type System
+ Descriptor</emphasis>. UIMA defines basic primitive types such as
+ Boolean, Byte, Short, Integer, Long, Float, and Double, as well as Arrays of these primitive
+ types. UIMA also defines the built-in types <literal>TOP</literal>, which is the root
+ of the type system, analogous to Object in Java; <literal>FSArray</literal>, which is
+ an array of Feature Structures (i.e. an array of instances of TOP); and
+ <literal>Annotation</literal>, which we will discuss in more detail in this section.</para>
+
+ <para>UIMA includes an Eclipse plug-in that will help you edit Type System
+ Descriptors, so if you are using Eclipse you will not need to worry about the details of
+ the XML syntax. See <olink targetdoc="&uima_docs_overview;"
+ targetptr="ugr.ovv.eclipse_setup"/> for instructions on setting up Eclipse and
+ installing the plugin.</para>
+
+ <para>The Type System Descriptor for our annotator is located in the file
+ <literal>descriptors/tutorial/ex1/TutorialTypeSystem.xml.</literal> (This
+ and all other examples are located in the <literal>examples</literal> directory of
+ the installation of the UIMA SDK, which can be imported into an Eclipse project for
+ your convenience, as described in <olink targetdoc="&uima_docs_overview;"
+ targetptr="ugr.ovv.eclipse_setup.example_code"/>.)</para>
+
+ <para>In Eclipse, expand the <literal>uimaj-examples</literal> project in the
+ Package Explorer view, and browse to the file
+ <literal>descriptors/tutorial/ex1/TutorialTypeSystem.xml</literal>.
+ Right-click on the file in the navigator and select Open With → Component
+ Descriptor Editor. Once the editor opens, click on the <quote>Type System</quote>
+ tab at the bottom of the editor window. You should see a view such as the
+ following:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="100" format="JPG" fileref="&imgroot;image002.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of editor for Type System Definitions</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>Our annotator will need only one type –
+ <literal>org.apache.uima.tutorial.RoomNumber</literal>. (We use the same
+ namespace conventions as are used for Java classes.) Just as in Java, types have
+ supertypes. The supertype is listed in the second column of the left table. In this
+ case our RoomNumber annotation extends from the built-in type
+ <literal>uima.tcas.Annotation</literal>.</para>
+
+ <para>Descriptions can be included with types and features. In this example, there is a
+ description associated with the <literal>building</literal> feature. To see it,
+ hover the mouse over the feature.</para>
+
+ <para>The bottom tab labeled <quote>Source</quote> will show you the XML source file
+ associated with this descriptor.</para>
+
+ <para>The built-in Annotation type declares three fields (called
+ <emphasis>Features</emphasis> in CAS terminology). The features <literal>begin</literal>
+ and <literal>end</literal> store the character offsets of the span of text to which the
+ annotation refers. The feature <literal>sofa</literal> (Subject of Analysis) indicates
+ which document the begin and end offsets point into. The <literal>sofa</literal> feature
+ can be ignored for now since we assume in this tutorial that the CAS contains only one
+ subject of analysis (document).</para>
+ <para>Our RoomNumber type will inherit these three features from
+ <literal>uima.tcas.Annotation</literal>, its supertype; they are not visible in
+ this view because inherited features are not shown. One additional feature,
+ <literal>building</literal>, is declared. It takes a String as its value. Instead
+ of String, we could have declared the range-type of our feature to be any other CAS type
+ (defined or built-in).</para>
+
+ <para>If you are not using Eclipse, if you need to edit the type system, do so using any XML
+ or text editor, directly. The following is the actual XML representation of the Type
+ System displayed above in the editor:</para>
+
+
+ <programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8" ?>
+ <typeSystemDescription xmlns="http://uima.apache.org/resourceSpecifier">
+ <name>TutorialTypeSystem</name>
+ <description>Type System Definition for the tutorial examples -
+ as of Exercise 1</description>
+ <vendor>Apache Software Foundation</vendor>
+ <version>1.0</version>
+ <types>
+ <typeDescription>
+ <name>org.apache.uima.tutorial.RoomNumber</name>
+ <description></description>
+ <supertypeName>uima.tcas.Annotation</supertypeName>
+ <features>
+ <featureDescription>
+ <name>building</name>
+ <description>Building containing this room</description>
+ <rangeTypeName>uima.cas.String</rangeTypeName>
+ </featureDescription>
+ </features>
+ </typeDescription>
+ </types>
+ </typeSystemDescription>]]></programlisting>
+
+ </section>
+
+ <section id="ugr.tug.aae.generating_jcas_sources">
+ <title>Generating Java Source Files for CAS Types</title>
+
+ <para>When you save a descriptor that you have modified, the Component Descriptor
+ Editor will automatically generate Java classes corresponding to the types that are
+ defined in that descriptor (unless this has been disabled), using a utility called
+ JCasGen. These Java classes will have the same name (including package) as the CAS
+ types, and will have get and set methods for each of the features that you have
+ defined.</para>
+
+ <para>This feature is enabled/disabled using the UIMA menu pulldown (or the Eclipse
+ Preferences → UIMA). If automatic running of JCasGen is not happening, please
+ make sure the option is checked:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image004.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of enabling automatic running of JCasGen</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>The Java class for the example org.apache.uima.tutorial.RoomNumber type can
+ be found in <literal>src/org/apache/uima/tutorial/RoomNumber.java</literal>
+ . You will see how to use these generated classes in the next section.</para>
+
+ <para>If you are not using the Component Descriptor Editor, you will need to generate
+ these Java classes by using the <emphasis>JCasGen</emphasis> tool. JCasGen reads a
+ Type System Descriptor XML file and generates the corresponding Java classes that
+ you can then use in your annotator code. To launch JCasGen, run the jcasgen shell
+ script located in the <literal>/bin</literal> directory of the UIMA SDK
+ installation. This should launch a GUI that looks something like this:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image006.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of JCasGen</phrase></textobject>
+ </mediaobject>
+</screenshot>
+
+ <para>Use the <quote>Browse</quote> buttons to select your input file
+ (TutorialTypeSystem.xml) and output directory (the root of the source tree into
+ which you want the generated files placed). Then click the <quote>Go</quote>
+ button. If the Type System Descriptor has no errors, new Java source files will be
+ generated under the specified output directory.</para>
+
+ <para>There are some additional options to choose from when running JCasGen; please
+ refer to the <olink targetdoc="&uima_docs_tools;"
+ targetptr="ugr.tools.jcasgen"/> for details.</para>
+ </section>
+
+ <section id="ugr.tug.aae.developing_annotator_code">
+ <title>Developing Your Annotator Code</title>
+
+ <para>Annotator implementations all implement a standard interface (AnalysisComponent), having several
+ methods, the most important of which are:
+
+ <itemizedlist spacing="compact">
+ <listitem>
+ <para><literal>initialize</literal>, </para>
+ </listitem>
+
+ <listitem>
+ <para><literal>process</literal>, and </para>
+ </listitem>
+
+ <listitem>
+ <para><literal>destroy</literal>. </para>
+ </listitem>
+ </itemizedlist></para>
+
+ <para><literal>initialize</literal> is called by the framework once when it first creates an instance of the
+ annotator class. <literal>process</literal> is called once per item being processed.
+ <literal>destroy</literal> may be called by the application when it is done using your annotator. There is a
+ default implementation of this interface for annotators using the JCas, called JCasAnnotator_ImplBase, which
+ has implementations of all required methods except for the process method.</para>
+
+ <para>Our annotator class extends the JCasAnnotator_ImplBase; most annotators that use the JCas will extend
+ from this class, so they only have to implement the process method. This class is not restricted to handling
+ just text; see <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>.</para>
+
+ <para>Annotators are not required to extend from the JCasAnnotator_ImplBase class; they may instead
+ directly implement the AnalysisComponent interface, and provide all method implementations themselves.
+ <footnote>
+ <para>Note that AnalysisComponent is not specific to JCAS. There is a method getRequiredCasInterface()
+ which the user would have to implement to return <literal>JCas.class</literal>. Then in the
+ <literal>process(AbstractCas cas)</literal> method, they would need to typecast
+ <literal>cas</literal> to type <literal>JCas</literal>.</para></footnote> This allows you to have
+ your annotator inherit from some other superclass if necessary. If you would like to do this, see the Javadocs
+ for JCasAnnotator for descriptions of the methods you must implement.</para>
+
+ <para>Annotator classes need to be public, cannot be declared abstract, and must have public, 0-argument
+ constructors, so that they can be instantiated by the framework. <footnote>
+ <para> Although Java classes in which you do not define any constructor will, by default, have a 0-argument
+ constructor that doesn't do anything, a class in which you have defined at least one constructor does
+ not get a default 0-argument constructor.</para> </footnote> .</para>
+
+ <para>The class definition for our RoomNumberAnnotator implements the process method, and is shown here. You
+ can find the source for this in the
+ <literal>uimaj-examples/src/org/apache/uima/tutorial/ex1/RoomNumberAnnotator.java</literal> .
+ <note>
+ <para>In Eclipse, in the <quote>Package Explorer</quote> view, this will appear by default in the project
+ <literal>uimaj-examples</literal>, in the folder <literal>src</literal>, in the package
+ <literal>org.apache.uima.tutorial.ex1</literal>.</para></note> In Eclipse, open the
+ RoomNumberAnnotator.java in the uimaj-examples project, under the src directory.</para>
+
+
+ <programlisting>package org.apache.uima.tutorial.ex1;
+
+import java.util.regex.Matcher;
+import java.util.regex.Pattern;
+
+import org.apache.uima.analysis_component.JCasAnnotator_ImplBase;
+import org.apache.uima.jcas.JCas;
+import org.apache.uima.tutorial.RoomNumber;
+
+/**
+ * Example annotator that detects room numbers using
+ * Java 1.4 regular expressions.
+ */
+public class RoomNumberAnnotator extends JCasAnnotator_ImplBase {
+ private Pattern mYorktownPattern =
+ Pattern.compile("\\b[0-4]\\d-[0-2]\\d\\d\\b");
+
+ private Pattern mHawthornePattern =
+ Pattern.compile("\\b[G1-4][NS]-[A-Z]\\d\\d\\b");
+
+ public void process(JCas aJCas) {
+ // Discussed Later
+ }
+}</programlisting>
+
+ <para>The two Java class fields, mYorktownPattern and mHawthornePattern, hold regular expressions that
+ will be used in the process method. Note that these two fields are part of the Java implementation of the
+ annotator code, and not a part of the CAS type system. We are using the regular expression facility that is
+ built into Java 1.4. It is not critical that you know the details of how this works, but if you are curious the
+ details can be found in the Java API docs for the java.util.regex package.</para>
+
+ <para>The only method that we are required to implement is <literal>process</literal>. This method is typically
+ called once for each document that is being analyzed. This method takes one argument, which is a JCas instance;
+ this holds the document to be analyzed and all of the analysis results. <footnote>
+ <para>Version 1 of UIMA specified an additional parameter, the ResultSpecification. This provides a
+ specification of which types and features are desired to be computed and "output" from this annotator. Its
+ use is optional; many annotators ignore it.</para>
+ <para> This parameter has been replaced by specific set/getResultSpecification() methods, which allow
+ the annotator to receive a signal (a method call) when the result specification changes.</para>
+ </footnote></para>
+
+
+ <programlisting>public void process(JCas aJCas) {
+ // get document text
+ String docText = aJCas.getDocumentText();
+ // search for Yorktown room numbers
+ Matcher matcher = mYorktownPattern.matcher(docText);
+ int pos = 0;
+ while (matcher.find(pos)) {
+ // found one - create annotation
+ RoomNumber annotation = new RoomNumber(aJCas);
+ annotation.setBegin(matcher.start());
+ annotation.setEnd(matcher.end());
+ annotation.setBuilding("Yorktown");
+ annotation.addToIndexes();
+ pos = matcher.end();
+ }
+ // search for Hawthorne room numbers
+ matcher = mHawthornePattern.matcher(docText);
+ pos = 0;
+ while (matcher.find(pos)) {
+ // found one - create annotation
+ RoomNumber annotation = new RoomNumber(aJCas);
+ annotation.setBegin(matcher.start());
+ annotation.setEnd(matcher.end());
+ annotation.setBuilding("Hawthorne");
+ annotation.addToIndexes();
+ pos = matcher.end();
+ }
+}</programlisting>
+
+ <para>The Matcher class is part of the java.util.regex package and is used to find the room numbers in the
+ document text. When we find one, recording the annotation is as simple as creating a new Java object and
+ calling some set methods:</para>
+
+
+ <programlisting>RoomNumber annotation = new RoomNumber(aJCas);
+annotation.setBegin(matcher.start());
+annotation.setEnd(matcher.end());
+annotation.setBuilding("Yorktown");</programlisting>
+
+ <para>The <literal>RoomNumber</literal> class was generated from the type system description by the
+ Component Descriptor Editor or the JCasGen tool, as discussed in the previous section.</para>
+
+ <para>Finally, we call <literal>annotation.addToIndexes()</literal> to add the new annotation to the
+ indexes maintained in the CAS. By default, the CAS implementation used for analysis of text documents keeps
+ an index of all annotations in their order from beginning to end of the document. Subsequent annotators or
+ applications use the indexes to iterate over the annotations. </para>
+
+ <note>
+ <para> If you don't add the instance to the indexes, it cannot be retrieved by down-stream annotators,
+ using the indexes. </para></note>
+
+ <note>
+ <para>You can also call <literal>addToIndexes()</literal> on Feature Structures that are not subtypes of
+ <literal>uima.tcas.Annotation</literal>, but these will not be sorted in any particular way. If you want
+ to specify a sort order, you can define your own custom indexes in the CAS: see <olink
+ targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/> and <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor.aes.index"/> for details.</para></note>
+
+ <para>We're almost ready to test the RoomNumberAnnotator. There is just one more step
+ remaining.</para>
+ </section>
+ <section id="ugr.tug.aae.creating_xml_descriptor">
+ <title>Creating the XML Descriptor</title>
+
+ <para>The UIMA architecture requires that descriptive information about an
+ annotator be represented in an XML file and provided along with the annotator class
+ file(s) to the UIMA framework at run time. This XML file is called an
+ <emphasis>Analysis Engine Descriptor</emphasis>. The descriptor includes:
+
+ <itemizedlist><listitem><para>Name, description, version, and vendor</para>
+ </listitem>
+
+ <listitem><para>The annotator's inputs and outputs, defined in terms of
+ the types in a Type System Descriptor</para></listitem>
+
+ <listitem><para>Declaration of the configuration parameters that the
+ annotator accepts </para></listitem></itemizedlist> </para>
+
+ <para>The <emphasis>Component Descriptor Editor</emphasis> plugin, which we
+ previously used to edit the Type System descriptor, can also be used to edit Analysis
+ Engine Descriptors.</para>
+
+ <para>A descriptor for our RoomNumberAnnotator is provided with the UIMA
+ distribution under the name
+ <literal>descriptors/tutorial/ex1/RoomNumberAnnotator.xml.</literal> To
+ edit it in Eclipse, right-click on that file in the navigator and select Open With
+ → Component Descriptor Editor.</para> <tip><para>In Eclipse, you can double
+ click on the tab at the top of the Component Descriptor Editor's window
+ identifying the currently selected editor, and the window will
+ <quote>Maximize</quote>. Double click it again to restore the original size.</para>
+ </tip>
+
+ <para>If you are not using Eclipse, you will need to edit Analysis Engine descriptors
+ manually. See <xref linkend="ugr.tug.aae.xml_intro_ae_descriptor"/> for an
+ introduction to the Analysis Engine descriptor XML syntax. The remainder of this
+ section assumes you are using the Component Descriptor Editor plug-in to edit the
+ Analysis Engine descriptor.</para>
+
+ <para>The Component Descriptor Editor consists of several tabbed pages; we will only
+ need to use a few of them here. For more information on using this editor, see <olink
+ targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>.</para>
+
+ <para>The initial page of the Component Descriptor Editor is the Overview page, which
+ appears as follows:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image008.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of Component Descriptor Editor overview page</phrase>
+ </textobject>
+ </mediaobject>
+</screenshot>
+
+ <para>This presents an overview of the RoomNumberAnnotator Analysis Engine (AE). The
+ left side of the page shows that this descriptor is for a
+ <emphasis>Primitive</emphasis> AE (meaning it consists of a single annotator),
+ and that the annotator code is developed in Java. Also, it specifies the Java class
+ that implements our logic (the code which was discussed in the previous section).
+ Finally, on the right side of the page are listed some descriptive attributes of our
+ annotator.</para>
+
+ <para>The other two pages that need to be filled out are the Type System page and the
+ Capabilities page. You can switch to these pages using the tabs at the bottom of the
+ Component Descriptor Editor. In the tutorial, these are already filled out for
+ you.</para>
+
+ <para>The RoomNumberAnnotator will be using the TutorialTypeSystem we looked at in
+ Section <xref linkend="ugr.tug.aae.defining_types"/>. To specify this, we add
+ this type system to the Analysis Engine's list of Imported Type Systems, using
+ the Type System page's right side panel, as shown here:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image010.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of CDE Type System page</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>On the Capabilities page, we define our annotator's inputs and outputs, in
+ terms of the types in the type system. The Capabilities page is shown below:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.3in" format="JPG" fileref="&imgroot;image012.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of CDE Capabilities page</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>Although capabilities come in sets, having multiple sets is deprecated; here
+ we're just using one set. The RoomNumberAnnotator is very simple. It requires
+ no input types, as it operates directly on the document text -- which is supplied as a
+ part of the CAS initialization (and which is always assumed to be present). It
+ produces only one output type (RoomNumber), and it sets the value of the
+ <literal>building</literal> feature on that type. This is all represented on the
+ Capabilities page.</para>
+
+ <para>The Capabilities page has two other parts for specifying languages and Sofas.
+ The languages section allows you to specify which languages your Analysis Engine
+ supports. The RoomNumberAnnotator happens to be language-independent, so we can
+ leave this blank. The Sofas section allows you to specify the names of additional
+ subjects of analysis. This capability and the Sofa Mappings at the bottom are
+ advanced topics, described in <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.aas"/>. </para>
+
+ <para>This is all of the information we need to provide for a simple annotator. If you
+ want to peek at the XML that this tool saves you from having to write, click on the
+ <quote>Source</quote> tab at the bottom to view the generated XML.</para>
+ </section>
+
+ <section id="ugr.tug.aae.testing_your_annotator">
+ <title>Testing Your Annotator</title>
+
+ <para>Having developed an annotator, we need a way to try it out on some example
+ documents. The UIMA SDK includes a tool called the Document Analyzer that will allow
+ us to do this. To run the Document Analyzer, execute the documentAnalyzer shell
+ script that is in the <literal>bin</literal> directory of your UIMA SDK
+ installation, or, if you are using the example Eclipse project, execute the
+ <quote>UIMA Document Analyzer</quote> run configuration supplied with that
+ project. (To do this, click on the menu bar Run → Run ... → and under Java
+ Applications in the left box, click on UIMA Document Analyzer.)</para>
+
+ <para>You should see a screen that looks like this:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image014.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of UIMA Document Analyzer GUI</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>There are six options on this screen:</para>
+
+ <orderedlist><listitem><para>Directory containing documents to analyze</para>
+ </listitem>
+
+ <listitem><para>Directory where analysis results will be written</para>
+ </listitem>
+
+ <listitem><para>The XML descriptor for the Analysis Engine (AE) you want to
+ run</para></listitem>
+
+ <listitem><para>(Optional) an XML tag, within the input documents, that contains
+ the text to be analyzed. For example, the value TEXT would cause the AE to only
+ analyze the portion of the document enclosed within
+ <TEXT>...</TEXT> tags.</para></listitem>
+
+ <listitem><para>Language of the document </para></listitem>
+
+ <listitem><para>Character encoding </para></listitem></orderedlist>
+
+ <para>Use the Browse button next to the third item to set the <quote>Location of AE XML
+ Descriptor</quote> field to the descriptor we've just been discussing
+ —
+ <literal><where-you-installed-uima-e.g.UIMA_HOME>
+ /examples/descriptors/tutorial/ex1/RoomNumberAnnotator.xml</literal>
+ . Set the other fields to the values shown in the screen shot above (which should be the
+ default values if this is the first time you've run the Document Analyzer). Then
+ click the <quote>Run</quote> button to start processing.</para>
+
+ <para>When processing completes, an <quote>Analysis Results</quote> window should
+ appear.</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="3.5in" format="JPG" fileref="&imgroot;image016.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of UIMA Document Analyzer Results GUI</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>Make sure <quote>Java Viewer</quote> is selected as the Results Display
+ Format, and <emphasis role="bold">double-click</emphasis> on the document
+ UIMASummerSchool2003.txt to view the annotations that were discovered. The view
+ should look something like this:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image018.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of UIMA CAS Annotation Viewer GUI</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>You can click the mouse on one of the highlighted annotations to see a list of all
+ its features in the frame on the right.</para> <note><para>The legend will only show
+ those types which have at least one instance in the CAS, and are declared as outputs in the
+ capabilities section of the descriptor (see <xref
+ linkend="ugr.tug.aae.creating_xml_descriptor"/>. </para></note>
+
+ <para>You can use the DocumentAnalyzer to test any UIMA annotator
+ — just make sure that the annotator's classes are in the class
+ path.</para>
+ </section>
+ </section>
+
+ <section id="ugr.tug.aae.configuration_logging">
+ <title>Configuration and Logging</title>
+
+ <section id="ugr.tug.aae.configuration_parameters">
+ <title>Configuration Parameters</title>
+
+ <para>The example RoomNumberAnnotator from the previous section used hardcoded
+ regular expressions and location names, which is obviously not very flexible. For
+ example, you might want to have the patterns of room numbers be supplied by a
+ configuration parameter, rather than having to redo the annotator's Java code
+ to add additional patterns. Rather than add a new hardcoded regular expression for a
+ new pattern, a better solution is to use configuration parameters.</para>
+
+ <para>UIMA allows annotators to declare configuration parameters in their
+ descriptors. The descriptor also specifies default values for the parameters,
+ though these can be overridden at runtime.</para>
+
+ <section id="ugr.tug.aae.declaring_parameters_in_the_descriptor">
+ <title>Declaring Parameters in the Descriptor</title>
+
+ <para>The example descriptor
+ <literal>descriptors/tutorial/ex2/RoomNumberAnnotator.xml</literal> is
+ the same as the descriptor from the previous section except that information has
+ been filled in for the Parameters and Parameter Settings pages of the Component
+ Descriptor Editor.</para>
+
+ <para>First, in Eclipse, open example two's RoomNumberAnnotator in the
+ Component Descriptor Editor, and then go to the Parameters page (click on the
+ parameters tab at the bottom of the window), which is shown below:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image020.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of UIMA Component Descriptor Editor (CDE) Parameters page</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>Two parameters – Patterns and Locations -- have been declared. In this
+ screen shot, the mouse (not shown) is hovering over Patterns to show its
+ description in the small popup window. Every parameter has the following
+ information associated with it:</para>
+
+ <itemizedlist><listitem><para>name – the name by which the annotator code
+ refers to the parameter</para></listitem>
+
+ <listitem><para>description – a natural language description of the
+ intent of the parameter</para></listitem>
+
+ <listitem><para>type – the data type of the parameter's value
+ – must be one of String, Integer, Float, or Boolean.</para></listitem>
+
+ <listitem><para>multiValued – true if the parameter can take
+ multiple-values (an array), false if the parameter takes only a single value.
+ Shown above as <literal>Multi</literal>.</para></listitem>
+
+ <listitem><para>mandatory – true if a value must be provided for the
+ parameter. Shown above as <literal>Req</literal> (for required). </para>
+ </listitem></itemizedlist>
+
+ <para>Both of our parameters are mandatory and accept an array of Strings as their
+ value.</para>
+
+ <para>Next, default values are assigned to the parameters on the Parameter Settings
+ page:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image022.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of UIMA Component Descriptor Editor (CDE) Parameter Settings page</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>Here the <quote>Patterns</quote> parameter is selected, and the right pane
+ shows the list of values for this parameter, in this case the regular expressions
+ that match particular room numbering conventions. Notice the third pattern is
+ new, for matching the style of room numbers in the third building, which has room
+ numbers such as <literal>J2-A11</literal>.</para>
+ </section>
+ <section id="ugr.tug.aae.accessing_parameter_values_from_annotator">
+ <title>Accessing Parameter Values from the Annotator Code</title>
+
+ <para>The class
+ <literal>org.apache.uima.tutorial.ex2.RoomNumberAnnotator</literal> has
+ overridden the initialize method. The initialize method is called by the UIMA
+ framework when the annotator is instantiated, so it is a good place to read
+ configuration parameter values. The default initialize method does nothing with
+ configuration parameters, so you have to override it. To see the code in Eclipse,
+ switch to the src folder, and open
+ <literal>org.apache.uima.tutorial.ex2</literal>. Here is the method
+ body:</para>
+
+
+ <programlisting>/**
+* @see AnalysisComponent#initialize(UimaContext)
+*/
+public void initialize(UimaContext aContext)
+ throws ResourceInitializationException {
+ super.initialize(aContext);
+
+ // Get config. parameter values
+ String[] patternStrings =
+ (String[]) aContext.getConfigParameterValue("Patterns");
+ mLocations =
+ (String[]) aContext.getConfigParameterValue("Locations");
+
+ // compile regular expressions
+ mPatterns = new Pattern[patternStrings.length];
+ for (int i = 0; i < patternStrings.length; i++) {
+ mPatterns[i] = Pattern.compile(patternStrings[i]);
+ }
+}</programlisting>
+
+ <para>Configuration parameter values are accessed through the UimaContext. As you
+ will see in subsequent sections of this chapter, the UimaContext is the
+ annotator's access point for all of the facilities provided by the UIMA
+ framework – for example logging and external resource access.</para>
+
+ <para>The UimaContext's <literal>getConfigParameterValue</literal>
+ method takes the name of the parameter as an argument; this must match one of the
+ parameters declared in the descriptor. The return value of this method is a Java
+ Object, whose type corresponds to the declared type of the parameter. It is up to the
+ annotator to cast it to the appropriate type, String[] in this case.</para>
+
+ <para>If there is a problem retrieving the parameter values, the framework throws an
+ exception. Generally annotators don't handle these, and just let them
+ propagate up.</para>
+
+ <para>To see the configuration parameters working, run the Document Analyzer
+ application and select the descriptor
+ <literal>examples/descriptors/tutorial/ex2/RoomNumberAnnotator.xml</literal>
+ . In the example document <literal>WatsonConferenceRooms.txt</literal>, you
+ should see some examples of Hawthorne II room numbers that would not have been
+ detected by the ex1 version of RoomNumberAnnotator.</para>
+ </section>
+
+ <section id="ugr.tug.aae.supporting_reconfiguration">
+ <title>Supporting Reconfiguration</title>
+
+ <para>If you take a look at the Javadocs (located in the <ulink
+ url="api/index.html">docs/api</ulink> directory) for
+ <literal>org.apache.uima.analysis_component.AnaysisComponent</literal>
+ (which our annotator implements indirectly through JCasAnnotator_ImplBase),
+ you will see that there is a reconfigure() method, which is called by the containing
+ application through the UIMA framework, if the configuration parameter values
+ are changed.</para>
+
+ <para>The AnalysisComponent_ImplBase class provides a default implementation
+ that just calls the annotator's destroy method followed by its initialize
+ method. This works fine for our annotator. The only situation in which you might
+ want to override the default reconfigure() is if your annotator has very expensive
+ initialization logic, and you don't want to reinitialize everything if just
+ one configuration parameter has changed. In that case, you can provide a more
+ intelligent implementation of reconfigure() for your annotator.</para>
+
+ </section>
+
+ <section id="ugr.tug.aae.configuration_parameter_groups">
+ <title>Configuration Parameter Groups</title>
+
+ <para>For annotators with many sets of configuration parameters, UIMA supports
+ organizing them into groups. It is possible to define a parameter with the same name
+ in multiple groups; one common use for this is for annotators that can process
+ documents in several languages and which want to have different parameter
+ settings for the different languages.</para>
+
+ <para>The syntax for defining parameter groups in your descriptor is fairly
+ straightforward – see <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor"/> for details. Values of
+ parameters defined within groups are accessed through the two-argument version
+ of <literal>UimaContext.getConfigParameterValue</literal>, which takes
+ both the group name and the parameter name as its arguments.</para>
+ </section>
+ </section>
+
+ <section id="ugr.tug.aae.logging">
+ <title>Logging</title>
+
+ <para>The UIMA SDK provides a logging facility, which is very similar to the
+ java.util.logging.Logger class that was introduced in Java 1.4.</para>
+
+ <para>In the Java architecture, each logger instance is associated with a name. By
+ convention, this name is often the fully qualified class name of the component
+ issuing the logging call. The name can be referenced in a configuration file when
+ specifying which kinds of log messages to actually log, and where they should
+ go.</para>
+
+ <para>The UIMA framework supports this convention using the
+ <literal>UimaContext</literal> object. If you access a logger instance using
+ <literal>getContext().getLogger()</literal> within an Annotator, the logger
+ name will be the fully qualified name of the Annotator implementation class.</para>
+
+ <para>Here is an example from the process method of
+ <literal>org.apache.uima.tutorial.ex2.RoomNumberAnnotator</literal>:
+
+
+ <programlisting>getContext().getLogger().log(Level.FINEST,"Found: " + annotation);</programlisting>
+ </para>
+
+ <para>The first argument to the log method is the level of the log output. Here, a value of
+ FINEST indicates that this is a highly-detailed tracing message. While useful for
+ debugging, it is likely that real applications will not output log messages at this
+ level, in order to improve their performance. Other defined levels, from lowest to
+ highest importance, are FINER, FINE, CONFIG, INFO, WARNING, and SEVERE.</para>
+
+ <para>If no logging configuration file is provided (see next section), the Java
+ Virtual Machine defaults would be used, which typically set the level to INFO and
+ higher messages, and direct output to the console.</para>
+
+ <para>If you specify the standard UIMA SDK <literal>Logger.properties,</literal>
+ the output will be directed to a file named uima.log, in the current working directory
+ (often the <quote>project</quote> directory when running from Eclipse, for
+ instance).</para> <note><para>When using Eclipse, the uima.log file, if written
+ into the Eclipse workspace in the project uimaj-examples, for example, may not appear
+ in the Eclipse package explorer view until you right-click the uimaj-examples project
+ with the mouse, and select <quote>Refresh</quote>. This operation refreshes the
+ Eclipse display to conform to what may have changed on the file system. Also, you can set
+ the Eclipse preferences for the workspace to automatically refresh (Window →
+ Preferences → General → Workspace, then click the <quote>refresh
+ automatically</quote> checkbox.</para></note>
+
+ <section id="ugr.tug.aae.logging.configuring">
+ <title>Specifying the Logging Configuration</title>
+
+ <para>The standard UIMA logger uses the underlying Java 1.4 logging mechanism. You
+ can use the APIs that come with that to configure the logging. In addition, the
+ standard Java 1.4 logging initialization mechanisms will look for a Java System
+ Property named <literal>java.util.logging.config.file</literal> and if
+ found, will use the value of this property as the name of a standard
+ <quote>properties</quote> file, for setting the logging level. Please refer to
+ the Java 1.4. documentation for more information on the format and use of this
+ file.</para>
+
+ <para>Two sample logging specification property files can be found in the UIMA_HOME
+ directory where the UIMA SDK is installed:
+ <literal>config/Logger.properties</literal>, and
+ <literal>config/FileConsoleLogger.properties</literal>. These specify the same
+ logging, except the first logs just to a file, while the second logs both to a file and
+ to the console. You can edit these files, or create additional ones, as described
+ below, to change the logging behavior.</para>
+
+ <para>When running your own Java application, you can specify the location of the
+ logging configuration file on your Java command line by setting the Java system
+ property <literal>java.util.logging.config.file</literal> to be the logging
+ configuration filename. This file specification can be either absolute or
+ relative to the working directory. For example:
+
+
+ <programlisting><?db-font-size 65% ?>java "-Djava.util.logging.config.file=C:/Program Files/apache-uima/config/Logger.properties"</programlisting>
+ <note><para>In a shell script, you can use environment variables such as
+ UIMA_HOME if convenient.</para></note> </para>
+
+ <para>If you are using Eclipse to launch your application, you can set this property
+ in the VM arguments section of the Arguments tab of the run configuration screen. If
+ you've set an environment variable UIMA_HOME, you could for example, use the
+ string:
+ <literal>"-Djava.util.logging.config.file=${env_var:UIMA_HOME}/config/Logger.properties".</literal>
+ </para>
+
+ <para>If you running the .bat or .sh files in the UIMA SDK's <literal>bin</literal> directory, you can specify the location of your
+ logger configuration file by setting the <literal>UIMA_LOGGER_CONFIG_FILE</literal> environment variable prior to running the script,
+ for example (on Windows):
+
+ <programlisting><?db-font-size 70% ?>set UIMA_LOGGER_CONFIG_FILE=C:/myapp/MyLogger.properties</programlisting>
+ </para>
+ </section>
+
+ <section id="ugr.tug.aae.logging.setting_logging_levels">
+ <title>Setting Logging Levels</title>
+
+ <para>Within the logging control file, the default global logging level specifies
+ which kinds of events are logged across all loggers. For any given facility this
+ global level can be overridden by a facility specific level. Multiple handlers are
+ supported. This allows messages to be directed to a log file, as well as to a
+ <quote>console</quote>. Note that the ConsoleHandler also has a separate level
+ setting to limit messages printed to the console. For example: <literal>.level=
+ INFO</literal> </para>
+
+ <para>The properties file can change where the log is written, as well.</para>
+
+ <para>Facility specific properties allow different logging for each class, as
+ well. For example, to set the com.xyz.foo logger to only log SEVERE messages:
+ <literal>com.xyz.foo.level = SEVERE</literal></para>
+
+ <para>If you have a sample annotator in the package
+ <literal>org.apache.uima.SampleAnnotator</literal> you can set the log level
+ by specifying: <literal>org.apache.uima.SampleAnnotator.level =
+ ALL</literal></para>
+
+ <para>There are other logging controls; for a full discussion, please read the
+ contents of the <literal>Logger.properties</literal> file and the Java
+ specification for logging in Java 1.4.</para>
+ </section>
+
+ <section id="ugr.tug.aae.logging.output_format">
+ <title>Format of logging output</title>
+
+ <para>The logging output is formatted by handlers specified in the properties file
+ for configuring logging, described above. The default formatter that comes with
+ the UIMA SDK formats logging output as follows:</para>
+
+ <para><literal>Timestamp - threadID: sourceInfo: Message level:
+ message</literal></para>
+
+ <para> Here's an example:</para>
+
+ <para><literal>7/12/04 2:15:35 PM - 10:
+ org.apache.uima.util.TestClass.main(62): INFO: You are not logged
+ in!</literal></para>
+ </section>
+
+ <section id="ugr.tug.aae.logging.meaning_of_severity_levels">
+ <title>Meaning of the logging severity levels</title>
+
+ <para>These levels are defined by the Java logging framework, which was
+ incorporated into Java as of the 1.4 release level. The levels are defined in the
+ Javadocs for java.util.logging.Level, and include both logging and tracing
+ levels:
+ <itemizedlist spacing="compact">
+ <listitem><para>OFF is a special level that can be used to turn off
+ logging.</para></listitem>
+
+ <listitem><para>ALL indicates that all messages should be logged. </para>
+ </listitem>
+
+ <listitem><para>CONFIG is a message level for configuration messages. These
+ would typically occur once (during configuration) in methods like
+ <literal>initialize()</literal>. </para></listitem>
+
+ <listitem><para>INFO is a message level for informational messages, for
+ example, connected to server IP: 192.168.120.12 </para></listitem>
+
+ <listitem><para>WARNING is a message level indicating a potential
+ problem.</para></listitem>
+
+ <listitem><para>SEVERE is a message level indicating a serious
+ failure.</para></listitem>
+ </itemizedlist></para>
+
+ <para> Tracing levels, typically used for debugging:
+ <itemizedlist>
+
+ <listitem><para>FINE is a message level providing tracing information,
+ typically at a collection level (messages occurring once per collection).
+ </para></listitem>
+
+ <listitem><para>FINER indicates a fairly detailed tracing message,
+ typically at a document level (once per document).</para></listitem>
+
+ <listitem><para>FINEST indicates a highly detailed tracing message. </para>
+ </listitem></itemizedlist></para>
+ </section>
+
+ <section id="ugr.tug.aae.logging.using_outside_of_an_annotator">
+ <title>Using the logger outside of an annotator</title>
+
+ <para>An application using UIMA may want to log its messages using the same logging
+ framework. This can be done by getting a reference to the UIMA logger, as follows:
+
+
+ <programlisting>Logger logger = UIMAFramework.getLogger(TestClass.class);</programlisting>
+ </para>
+
+ <para>The optional class argument allows filtering by class (if the log handler
+ supports this). If not specified, the name of the returned logger instance is
+ <quote>org.apache.uima</quote>.</para>
+ </section>
+
+ <section id="ugr.tug.aae.logging.change_logger_implementation">
+ <title>Changing the underlying UIMA logging implementation</title>
+
+ <para>By default the UIMA framework use, under the hood of the UIMA Logger interface, the Java logging framework
+ to do logging. But it is possible to change the logging implementation that UIMA use from Java logging to
+ an arbitrary logging system when specifying the system property
+ <programlisting>-Dorg.apache.uima.logger.class=<loggerClass></programlisting>
+ when the UIMA framework is started.
+ </para>
+ <para>
+ The specified logger class must be available in the classpath and have to implement the
+ <code>org.apache.uima.util.Logger</code> interface.
+ </para>
+
+ <para>
+ UIMA also provides a logging implementation that use Apache Log4j instead of Java logging. To
+ use Log4j you have to provide the Log4j jars in the classpath and your application
+ must specify the logging configuration as shown below.
+ <programlisting><?db-font-size 80% ?>-Dorg.apache.uima.logger.class=<org.apache.uima.util.impl.Log4jLogger_impl></programlisting>
+ </para>
+ </section>
+
+
+ </section>
+ </section>
+ <section id="ugr.tug.aae.building_aggregates">
+ <title>Building Aggregate Analysis Engines</title>
+
+ <section id="ugr.tug.aae.combining_annotators">
+ <title>Combining Annotators</title>
+
+ <para>The UIMA SDK makes it very easy to combine any sequence of Analysis Engines to
+ form an <emphasis>Aggregate Analysis Engine</emphasis>. This is done through an
+ XML descriptor; no Java code is required!</para>
+
+ <para>If you go to the <literal>examples/descriptors/tutorial/ex3</literal>
+ folder (in Eclipse, it's in your uimaj-examples project, under the
+ <literal>descriptors/tutorial/ex3</literal> folder), you will find a
+ descriptor for a TutorialDateTime annotator. This annotator detects dates and
+ times (and also sentences and words). To see what this annotator can do, try it out
+ using the Document Analyzer. If you are curious as to how this annotator works, the
+ source code is included, but it is not necessary to understand the code at this
+ time.</para>
+
+ <para>We are going to combine the TutorialDateTime annotator with the
+ RoomNumberAnnotator to create an aggregate Analysis Engine. This is illustrated
+ in the following figure:
+
+ <figure id="ugr.tug.aae.fig.combining_annotators">
+ <title>Combining Annotators to form an Aggregate Analysis Engine</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="PNG"
+ fileref="&imgroot;image024.png"/>
+ </imageobject>
+ <textobject> <phrase>Combining Annotators to form an Aggregate Analysis
+ Engine</phrase>
+ </textobject>
+ </mediaobject>
+ </figure> </para>
+
+ <para>The descriptor that does this is named
+ <literal>RoomNumberAndDateTime.xml</literal>, which you can open in the
+ Component Descriptor Editor plug-in. This is in the uimaj-examples project in the
+ folder <literal>descriptors/tutorial/ex3</literal>. </para>
+
+ <para>The <quote>Aggregate</quote> page of the Component Descriptor Editor is
+ used to define which components make up the aggregate. A screen shot is shown below.
+ (If you are not using Eclipse, see <xref
+ linkend="ugr.tug.aae.xml_intro_ae_descriptor"/> for the actual XML syntax
+ for Aggregate Analysis Engine Descriptors.)</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image026.jpg"/>
+ </imageobject>
+ <textobject>
+ <phrase>Aggregate page of the Component Descriptor Editor (CDE)</phrase>
+ </textobject>
+ </mediaobject>
+</screenshot>
+
+ <para>On the left side of the screen is the list of component engines that make up the
+ aggregate – in this case, the TutorialDateTime annotator and the
+ RoomNumberAnnotator. To add a component, you can click the <quote>Add</quote>
+ button and browse to its descriptor. You can also click the <quote>Find AE</quote>
+ button and search for an Analysis Engine in your Eclipse workspace.
+ <note><para>The <quote>AddRemote</quote> button is used for adding components
+ which run remotely (for example, on another machine using a remote networking
+ connection). This capability is described in section <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.application.how_to_call_a_uima_service"/>,</para>
+ </note> </para>
+
+ <para>The order of the components in the left pane does not imply an order of
+ execution. The order of execution, or <quote>flow</quote> is determined in the
+ <quote>Component Engine Flow</quote> section on the right. UIMA supports
+ different types of algorithms (including user-definable) for determining the
+ flow. Here we pick the simplest: <literal>FixedFlow</literal>. We have chosen to
+ have the RoomNumberAnnotator execute first, although in this case it
+ doesn't really matter, since the RoomNumber and DateTime annotators do not
+ have any dependencies on one another.</para>
+
+ <para>If you look at the <quote>Type System</quote> page of the Component
+ Descriptor Editor, you will see that it displays the type system but is not
+ editable. The Type System of an Aggregate Analysis Engine is automatically
+ computed by merging the Type Systems of all of its components.</para>
+
+ <warning><para>If the components have different definitions for the same type name,
+ The Component Descriptor Editor will show a warning. It is possible to continue past
+ this warning, in which case your aggregate's type system will have the correct
+ <quote>merged</quote>
+ type definition that contains all of the features defined on that type by all of your
+ components. However, it is not recommended to use this feature in conjunction with JCAS,
+ since the JCAS Java Class definitions cannot be so easily merged. See
+ <olink
+ targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.jcas.merging_types_from_other_specs"/> for more information.
+ </para></warning>
+
+ <para>The Capabilities page is where you explicitly declare the aggregate Analysis
+ Engine's inputs and outputs. Sofas and Languages are described later.
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image028.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screen shot of the Capabilities page of the Component Descriptor Editor
+ </phrase></textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+ <para>Note that it is not automatically assumed that all outputs of each component
+ Analysis Engine (AE) are passed through as outputs of the aggregate AE. In this
+ case, for example, we have decided to suppress the Word and Sentence annotations
+ that are produced by the TutorialDateTime annotator.</para>
+
+ <para>You can run this AE using the Document Analyzer in the same way that you run any
+ other AE. Just select the <literal>examples/descriptors/tutorial/ex3/
+ RoomNumberAndDateTime.xml</literal> descriptor and click the Run button. You
+ should see that RoomNumbers, Dates, and Times are all shown but that Words and
+ Sentences are not:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image030.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screen shot results of running the Document Analyzer
+ </phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ </section>
+
+ <section id="ugr.tug.aae.aaes_can_contain_cas_consumers">
+ <title>AAEs can also contain CAS Consumers</title>
+
+ <para>In addition to aggregating Analysis Engines, Aggregates can also contain CAS
+ Consumers (see <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.cpe"/>, or even a mixture of these components with regular
+ Analysis Engines. The UIMA Examples has an example of an Aggregate which contains
+ both an analysis engine and a CAS consumer, in
+ <literal>examples/descriptors/MixedAggregate.xml.</literal></para>
+
+ <para>Analysis Engines support the <literal>collectionProcessComplete</literal>
+ method, which is particularly important for many CAS Consumers. If
+ an application (or a Collection Processing Engine) calls
+ <literal>collectionProcessComplete</literal> no an aggregate, the framework
+ will deliver that call to all of the components of the aggregate. If you use
+ one of the built-in flow types (fixedFlow or capabilityLanguageFlow), then the
+ order specified in that flow will be the same order in which the
+ <literal>collectionProcessComplete</literal> calls are made to the components.
+ If a custom flow is used, then the calls will be made in arbitrary order.
+ </para>
+ </section>
+
+ <section id="ugr.tug.aae.reading_results_previous_annotators">
+ <title>Reading the Results of Previous Annotators</title>
+
+ <para>So far, we have been looking at annotators that look directly at the document text. However, annotators
+ can also use the results of other annotators. One useful thing we can do at this point is look for the
+ co-occurrence of a Date, a RoomNumber, and two Times – and annotate that as a Meeting.</para>
+
+ <para>The CAS maintains <emphasis>indexes</emphasis> of annotations, and from an index you can obtain an
+ iterator that allows you to step through all annotations of a particular type. Here's some example code
+ that would iterate over all of the TimeAnnot annotations in the JCas:
+
+
+ <programlisting>FSIndex timeIndex = aJCas.getAnnotationIndex(TimeAnnot.type);
+Iterator timeIter = timeIndex.iterator();
+while (timeIter.hasNext()) {
+ TimeAnnot time = (TimeAnnot)timeIter.next();
+
+ //do something
+}</programlisting></para>
+
+ <note>
+ <para>You can also use the method
+ <literal>JCAS.getJFSIndexRepository().getAllIndexedFS(YourClass.type)</literal>, which returns an iterator
+ over all instances of <literal>YourClass</literal> in no particular order. This can be useful for types
+ that are not subtypes of the built-in Annotation type and which therefore have no default sort order.</para>
+
+ <para>Also, if you've defined your own custom index as described in <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor.aes.index"/>, you can get an iterator over that
+ specific index by calling <literal>aJCas.getJFSIndexRepository().getIndex(label)</literal>.
+ The <literal>getIndex(...)</literal> method has also a 2 argument form; the second argument,
+ if used, specialized the index to subtype of the type the index was declared to index. For instance,
+ if you defined an index called "allEvents" over the type <literal>Event</literal>, and wanted
+ to get an index over just a particular subtype of event, say, <literal>TimeEvent</literal>,
+ you can ask for that index using
+ <literal>aJCas.getJFSIndexRepository().getIndex("allEvents", TimeEvent.type)</literal>.</para></note>
+
+ <para>Now that we've explained the basics, let's take a look at the process method for
+ <literal>org.apache.uima.tutorial.ex4.MeetingAnnotator</literal>. Since we're looking for a
+ combination of a RoomNumber, a Date, and two Times, there are four nested iterators. (There's surely a
+ better algorithm for doing this, but to keep things simple we're just going to look at every combination
+ of the four items.)</para>
+
+ <para>For each combination of the four annotations, we compute the span of text that includes all of them, and
+ then we check to see if that span is smaller than a <quote>window</quote> size, a configuration parameter.
+ There are also some checks to make sure that we don't annotate the same span of text multiple times. If all
+ the checks pass, we create a Meeting annotation over the whole span. There's really nothing to
+ it!</para>
+
+ <para>The XML descriptor, located in
+ <literal>examples/descriptors/tutorial/ex4/MeetingAnnotator.xml</literal> , is also very
+ straightforward. An important difference from previous descriptors is that this is the first annotator
+ we've discussed that has input requirements. This can be seen on the <quote>Capabilities</quote>
+ page of the Component Descriptor Editor:</para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image032.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screen shot of Capabilities page of the Component Descriptor Editor
+ </phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>If we were to run the MeetingAnnotator on its own, it wouldn't detect anything because it
+ wouldn't have any input annotations to work with. The required input annotations can be produced by the
+ RoomNumber and DateTime annotators. So, we create an aggregate Analysis Engine containing these two
+ annotators, followed by the Meeting annotator. This aggregate is illustrated in <xref
+ linkend="ugr.tug.aae.fig.aggregate_for_meeting_annotator"/>. The descriptor for this is in
+ <literal>examples/descriptors/tutorial/ex4/MeetingDetectorAE.xml</literal> . Give it a try in the
+ Document Analyzer.
+
+ <figure id="ugr.tug.aae.fig.aggregate_for_meeting_annotator">
+ <title>An Aggregate Analysis Engine where an internal component uses output from previous
+ engines</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="PNG" fileref="&imgroot;image034.png"/>
+ </imageobject>
+ <textobject><phrase>An Aggregate Analysis Engine where an internal component uses output from
+ previous engines. </phrase>
+ </textobject>
+ </mediaobject>
+ </figure> </para>
+
+ </section>
+ </section>
+
+ <section id="ugr.tug.aae.other_examples">
+ <title>Other examples</title>
+
+ <para>The UIMA SDK include several other examples you may find interesting,
+ including</para>
+
+ <itemizedlist spacing="compact">
+ <listitem><para>SimpleTokenAndSentenceAnnotator – a simple tokenizer and
+ sentence annotator.</para></listitem>
+
+ <listitem><para>XmlDetagger – A multi-sofa annotator that does XML
+ detagging. Multiple Sofas (Subjects of Analysis) are described in a later –
+ see <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.mvs"/>. Reads XML data from the input Sofa
+ (named "xmlDocument"); this data can be stored in the CAS as a string or array, or it can
+ be a URI to a remote file. The XML is parsed using the JVM's default parser, and the
+ plain-text content is written to a new sofa called "plainTextDocument".</para>
+ </listitem>
+
+ <listitem><para>PersonTitleDBWriterCasConsumer – a sample CAS Consumer
+ which populates a relational database with some annotations. It uses JDBC and in this
+ example, hooks up with the Open Source Apache Derby database. </para></listitem>
+ </itemizedlist>
+ </section>
+
+ <section id="ugr.tug.aae.additional_topics">
+ <title>Additional Topics</title>
+
+ <section id="ugr.tug.aae.contract_for_annotator_methods">
+ <title>Contract: Annotator Methods Called by the Framework</title>
+ <titleabbrev>Annotator Methods</titleabbrev>
+
+ <para>The UIMA framework ensures that an Annotator instance is called by only one
+ thread at a time. An instance never has to worry about running some method on one
+ thread, and then asynchronously being called using another thread. This approach
+ simplifies the design of annotators – they do not have to be designed to support
+ multi-threading. When multiple threading is wanted, for performance, multiple
+ instances of the Annotator are created, each one running on just one thread.</para>
+
+ <para>The following table defines the methods called by the framework, when they are
+ called, and the requirements annotator implementations must follow.</para>
+
+ <informaltable frame="all">
+ <tgroup cols="3" colsep="1" rowsep="1">
+ <colspec colname="c1" colwidth="1*"/>
+ <colspec colname="c2" colwidth="2*"/>
+ <colspec colname="c3" colwidth="2*"/>
+ <thead>
+ <row>
+ <entry align="center">Method</entry>
+ <entry align="center">When Called by Framework</entry>
+ <entry align="center">Requirements</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>initialize</entry>
+ <entry>Typically only called once, when instance is created. Can be called
+ again if application does a reinitialize call and the default behavior
+ isn't overridden (the default behavior for reinitialize is to call
+ <literal>destroy</literal> followed by
+ <literal>initialize</literal></entry>
+ <entry>Normally does one-time initialization, including reading of
+ configuration parameters. If the application changes the parameters, it
+ can call initialize to have the annotator re-do its
+ initialization.</entry>
+ </row>
+ <row>
+ <entry>typeSystemInit</entry>
+ <entry>Called before <literal>process</literal> whenever the type system
+ in the CAS being passed in differs from what was previously passed in a
+ <literal>process</literal> call (and called for the first CAS passed in,
+ too). The Type System being passed to an annotator only changes in the case of
+ remote annotators that are active as servers, receiving possibly
+ different type systems to operate on.</entry>
+ <entry>Typically, users of JCas do not implement any method for this. An
+ annotator can use this call to read the CAS type system and setup any instance
+ variables that make accessing the types and features convenient.</entry>
+ </row>
+ <row>
+ <entry>process</entry>
+ <entry>Called once for each CAS. Called by the application if not using
+ Collection Processing Manager (CPM); the application calls the process
+ method on the analysis engine, which is then delegated by the framework to
+ all the annotators in the engine. For Collection Processing application,
+ the CPM calls the process method. If the application creates and manages
+ your own Collection Processing Engine via API calls (see Javadocs), the
+ application calls this on the Collection Processing Engine, and it is
+ delegated by the framework to the components.</entry>
+ <entry>Process the CAS, adding and/or modifying elements in it</entry>
+ </row>
+ <row>
+ <entry>destroy</entry>
+ <entry>This method can be called by applications, and is also called by the
+ Collection Processing Manager framework when the collection processing
+ completes. It is also called on Aggregate delegate components, if those
+ components successfully complete their <literal>initialize</literal> call, if
+ a subsequent delegate (or flow controller) in the aggregate fails to initialize.
+ This allows components which need to clean up things done during initialization
+ to do so. It is up to the component writer to use a try/finally construct during initialization
+ to cleanup from errors that occur during initialization within one component.
+ The <literal>destroy</literal> call on an aggregate is
+ propagated to all contained analysis engines.</entry>
+ <entry>An annotator should release all resources, close files, close
+ database connections, etc., and return to a state where another initialize
+ call could be received to restart. Typically, after a destroy call, no
+ further calls will be made to an annotator instance.</entry>
+ </row>
+ <row>
+ <entry>reconfigure</entry>
+ <entry><para>This method is never called by the framework, unless an
+ application calls it on the Engine object – in which case it the
+ framework propagates it to all annotators contained in the Engine.</para>
+ <para>Its purpose is to signal that the configuration parameters have
+ changed.</para></entry>
+ <entry>A default implementation of this calls destroy, followed by
+ initialize. This is the only case where initialize would be called more than
+ once. Users should implement whatever logic is needed to return the
+ annotator to an initialized state, including re-reading the
+ configuration parameter data.</entry>
+ </row>
+ </tbody>
+ </tgroup>
+ </informaltable>
+
+ </section>
+
+ <section id="ugr.tug.aae.reporting_errors_from_annotators">
+ <title>Reporting errors from Annotators</title>
+
+ <para>There are two broad classes of errors that can occur: recoverable and
+ unrecoverable. Because Annotators are often expected to process very large numbers
+ of artifacts (for example, text documents), they should be written to recover where
+ possible.</para>
+
+ <para>For example, if an upstream annotator created some input for an annotator which
+ is invalid, the annotator may want to log this event, ignore the bad input and
+ continue. It may include a notification of this event in the CAS, for further
+ downstream annotators to consider. Or, it may throw an exception (see next section)
+ – but in this case, it cannot do any further processing on that
+ document.</para> <note><para>The choice of what to do can be made configurable,
+ using the configuration parameters. </para></note>
+
+ </section>
+
+ <section id="ugr.tug.aae.throwing_exceptions_from_annotators">
+ <title>Throwing Exceptions from Annotators</title>
+
+ <para>Let's say an invalid regular expression was passed as a parameter to the
+ RoomNumberAnnotator. Because this is an error related to the overall
+ configuration, and not something we could expect to ignore, we should throw an
+ appropriate exception, and most Java programmers would expect to do so like
+ this:</para>
+
+
+ <programlisting>throw new ResourceInitializationException(
+ "The regular expression " + x + " is not valid.");</programlisting>
+
+ <para>UIMA, however, does not do it this way. All UIMA exceptions are
+ <emphasis>internationalized</emphasis>, meaning that they support translation
+ into other languages. This is accomplished by eliminating hardcoded message
+ strings and instead using external message digests. Message digests are files
+ containing (key, value) pairs. The key is used in the Java code instead of the actual
+ message string. This allows the message string to be easily translated later by
+ modifying the message digest file, not the Java code. Also, message strings in the
+ digest can contain parameters that are filled in when the exception is thrown. The
+ format of the message digest file is described in the Javadocs for the Java class
+ <literal>java.util.PropertyResourceBundle</literal> and in the load method of
+ <literal>java.util.Properties</literal>.</para>
+
+ <para>The first thing an annotator developer must choose is what Exception class to
+ use. There are three to choose from:
+
+ <orderedlist><listitem><para>ResourceConfigurationException should be
+ thrown from the annotator's reconfigure() method if invalid configuration
+ parameter values have been specified.
+ </para></listitem>
+
+ <listitem><para>ResourceInitializationException should be thrown from the
+ annotator's initialize() method if initialization fails for any
+ reason (including invalid configuration parameters).</para></listitem>
+
+ <listitem><para>AnalysisEngineProcessException should be thrown from the
+ annotator's process() method if the processing of a particular document
+ fails for any reason. </para></listitem></orderedlist></para>
+
+ <para>Generally you will not need to define your own custom exception classes, but if
+ you do they must extend one of these three classes, which are the only types of
+ Exceptions that the annotator interface permits annotators to throw.</para>
+
+ <para>All of the UIMA Exception classes share common constructor varieties. There are
+ four possible arguments:</para>
+
+ <para>The name of the message digest to use (optional – if not specified the
+ default UIMA message digest is used).</para>
+
+ <para>The key string used to select the message in the message digest.</para>
+
+ <para>An object array containing the parameters to include in the message. Messages
+ can have substitutable parts. When the message is given, the string representation
+ of the objects passed are substituted into the message. The object array is often
+ created using the syntax new Object[]{x, y}.</para>
+
+ <para>Another exception which is the <quote>cause</quote> of the exception you are
+ throwing. This feature is commonly used when you catch another exception and rethrow
+ it. (optional)</para>
+
[... 1038 lines stripped ...]