You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2008/08/28 23:28:16 UTC
svn commit: r689997 [2/32] - in /incubator/uima/uimaj/trunk/uima-docbooks: ./ src/ src/docbook/overview_and_setup/ src/docbook/references/ src/docbook/tools/ src/docbook/tutorials_and_users_guides/ src/docbook/uima/organization/ src/olink/references/

Modified: incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml
URL: http://svn.apache.org/viewvc/incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml?rev=689997&r1=689996&r2=689997&view=diff
==============================================================================
--- incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml (original)
+++ incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml Thu Aug 28 14:28:14 2008
@@ -1,978 +1,978 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
-"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
-<!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:">
-<!ENTITY imgroot "../images/overview_and_setup/conceptual_overview_files/" >
-<!ENTITY % uimaents SYSTEM "../entities.ent" >  
-%uimaents;
-]>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements.  See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership.  The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License.  You may obtain a copy of the License at
-
-   http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied.  See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<chapter id="ugr.ovv.conceptual">
-  <title>UIMA Conceptual Overview</title>
-  
-  <para>UIMA is an open, industrial-strength, scaleable and extensible platform for
-    creating, integrating and deploying unstructured information management solutions
-    from powerful text or multi-modal analysis and search components. </para>
-  
-  <para>The Apache UIMA project is an implementation of the Java UIMA framework available
-    under the Apache License, providing a common foundation for industry and academia to
-    collaborate and accelerate the world-wide development of technologies critical for
-    discovering vital knowledge present in the fastest growing sources of information
-    today.</para>
-  
-  <para>This chapter presents an introduction to many essential UIMA concepts. It is meant to
-    provide a broad overview to give the reader a quick sense of UIMA&apos;s basic
-    architectural philosophy and the UIMA SDK&apos;s capabilities. </para>
-  
-  <para>This chapter provides a general orientation to UIMA and makes liberal reference to
-    the other chapters in the UIMA SDK documentation set, where the reader may find detailed
-    treatments of key concepts and development practices. It may be useful to refer to <olink
-      targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar
-    with the terminology in this overview.</para>
-  
-  <section id="ugr.ovv.conceptual.uima_introduction">
-    <title>UIMA Introduction</title>
-    <figure id="ugr.ovv.conceptual.fig.bridge">
-      <title>UIMA helps you build the bridge between the unstructured and structured
-        worlds</title>
-      <mediaobject>
-        <imageobject>
-          <imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/>
-        </imageobject>
-        <textobject><phrase>Picture of a bridge between unstructured information
-          artifacts and structured metadata about those artifacts</phrase>
-        </textobject>
-      </mediaobject>
-    </figure>
-    
-    <para> Unstructured information represents the largest, most current and fastest
-      growing source of information available to businesses and governments. The web is just
-      the tip of the iceberg. Consider the mounds of information hosted in the enterprise and
-      around the world and across different media including text, voice and video. The
-      high-value content in these vast collections of unstructured information is,
-      unfortunately, buried in lots of noise. Searching for what you need or doing
-      sophisticated data mining over unstructured information sources presents new
-      challenges. </para>
-    
-    <para>An unstructured information management (UIM) application may be generally
-      characterized as a software system that analyzes large volumes of unstructured
-      information (text, audio, video, images, etc.) to discover, organize and deliver
-      relevant knowledge to the client or application end-user. An example is an application
-      that processes millions of medical abstracts to discover critical drug interactions.
-      Another example is an application that processes tens of millions of documents to
-      discover key evidence indicating probable competitive threats. </para>
-    
-    <para>First and foremost, the unstructured data must be analyzed to interpret, detect
-      and locate concepts of interest, for example, named entities like persons,
-      organizations, locations, facilities, products etc., that are not explicitly tagged
-      or annotated in the original artifact. More challenging analytics may detect things
-      like opinions, complaints, threats or facts. And then there are relations, for
-      example, located in, finances, supports, purchases, repairs etc. The list of concepts 
-      important for applications to discover in unstructured content is large, varied and 
-      often domain specific. 
-      Many different component analytics may solve different parts of the overall analysis task. 
-      These component analytics must interoperate and must be easily combined to facilitate 
-      the developed of UIM applications.</para>
-    
-    <para>The result of analysis are used to populate structured forms so that conventional 
-      data processing and search technologies 
-      like search engines, database engines or OLAP
-      (On-Line Analytical Processing, or Data Mining) engines 
-      can efficiently deliver the newly discovered content in response to the client requests 
-      or queries.</para>
-    
-    <para>In analyzing unstructured content, UIM applications make use of a variety of
-      analysis technologies including:</para>
-    
-    <itemizedlist spacing="compact">
-      <listitem><para>Statistical and rule-based Natural Language Processing
-        (NLP)</para>
-      </listitem>
-      <listitem><para>Information Retrieval (IR)</para>
-      </listitem>
-      <listitem><para>Machine learning</para>
-      </listitem>
-      <listitem><para>Ontologies</para>
-      </listitem>
-      <listitem><para>Automated reasoning and</para>
-      </listitem>
-      <listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para>
-      </listitem>
-      
-    </itemizedlist>
-    
-    <para>Specific analysis capabilities using these technologies are developed 
-      independently using different techniques, interfaces and platforms.
-      </para>
-    
-    <para>The bridge from the unstructured world to the structured world is built through the
-      composition and deployment of these analysis capabilities. This integration is often
-      a costly challenge. </para>
-    
-    <para>The Unstructured Information Management Architecture (UIMA) is an architecture
-      and software framework that helps you build that bridge. It supports creating,
-      discovering, composing and deploying a broad range of analysis capabilities and
-      linking them to structured information services.</para>
-    
-    <para>UIMA allows development teams to match the right skills with the right parts of a
-      solution and helps enable rapid integration across technologies and platforms using a
-      variety of different deployment options. These ranging from tightly-coupled
-      deployments for high-performance, single-machine, embedded solutions to parallel
-      and fully distributed deployments for highly flexible and scaleable
-      solutions.</para>
-    
-  </section>
-  
-  <section id="ugr.ovv.conceptual.architecture_framework_sdk">
-    <title>The Architecture, the Framework and the SDK</title>
-    <para>UIMA is a software architecture which specifies component interfaces, data
-      representations, design patterns and development roles for creating, describing,
-      discovering, composing and deploying multi-modal analysis capabilities.</para>
-    
-    <para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time
-      environment in which developers can plug in their UIMA component implementations and
-      with which they can build and deploy UIM applications. The framework is not specific to
-      any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA
-      Framework.</para>
-    
-    <para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis>
-      includes the UIMA framework, plus tools and utilities for using UIMA. Some of the
-      tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>)
-      development environment. </para>
-    
-  </section>
-  
-  <section id="ugr.ovv.conceptual.analysis_basics">
-    <title>Analysis Basics</title>
-    <note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator
-      Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA
-      Context.</para>
-    </note>
-    
-    <section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results">
-      <title>Analysis Engines, Annotators &amp; Results</title>
-      <figure id="ugr.ovv.conceptual.metadata_in_cas">
-        <title>Objects represented in the Common Analysis Structure (CAS)</title>
-        <mediaobject>
-          <imageobject role="html">
-            <imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/>
-          </imageobject>
-          <imageobject role="fo">
-            <imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/>
-          </imageobject>          
-          <textobject><phrase>Picture of some text, with a hierarchy of discovered
-            metadata about words in the text, including some image of a person as metadata
-            about that name.</phrase>
-          </textobject>
-        </mediaobject>
-      </figure>
-      
-      <para>UIMA is an architecture in which basic building blocks called Analysis Engines
-        (AEs) are composed to analyze a document and infer and record descriptive attributes
-        about the document as a whole, and/or about regions therein. This descriptive
-        information, produced by AEs is referred to generally as <emphasis role="bold">
-        analysis results</emphasis>. Analysis results typically represent meta-data
-        about the document content. One way to think about AEs is as software agents that
-        automatically discover and record meta-data about original content.</para>
-      
-      <para>UIMA supports the analysis of different modalities including text, audio and
-        video. The majority of examples we provide are for text. We use the term <emphasis
-          role="bold">document, </emphasis>therefore, to generally refer to any unit of
-        content that an AE may process, whether it is a text document or a segment of audio, for
-        example. See the section <olink targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.mvs"/> for more information on multimodal processing
-        in UIMA.</para>
-      
-      <para>Analysis results include different statements about the content of a document.
-        For example, the following is an assertion about the topic of a document:</para>
-      
-      
-      <programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting>
-      
-      <para>Analysis results may include statements describing regions more granular than
-        the entire document. We use the term <emphasis role="bold">span</emphasis> to
-        refer to a sequence of characters in a text document. Consider that a document with the
-        identifier D102 contains a span, <quote>Fred Centers</quote> starting at
-        character position 101. An AE that can detect persons in text may represent the
-        following statement as an analysis result:</para>
-      
-      
-      <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
-      
-      <para>In both statements 1 and 2 above there is a special pre-defined term or what we call
-        in UIMA a <emphasis role="bold">Type</emphasis>. They are
-        <emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively.
-        UIMA types characterize the kinds of results that an AE may create &ndash; more on
-        types later.</para>
-      
-      <para>Other analysis results may relate two statements. For example, an AE might
-        record in its results that two spans are both referring to the same person:</para>
-      
-      
-      <programlisting>(3) The Person denoted by span 101 to 112 and 
-  the Person denoted by span 141 to 143 in document D102 
-  refer to the same Entity.</programlisting>
-      
-      <para>The above statements are some examples of the kinds of results that AEs may record
-        to describe the content of the documents they analyze. These are not meant to indicate
-        the form or syntax with which these results are captured in UIMA &ndash; more on that
-        later in this overview.</para>
-      
-      <para>The UIMA framework treats Analysis engines as pluggable, composible,
-        discoverable, managed objects. At the heart of AEs are the analysis algorithms that
-        do all the work to analyze documents and record analysis results. </para>
-      
-      <para>UIMA provides a basic component type intended to house the core analysis
-        algorithms running inside AEs. Instances of this component are called <emphasis
-          role="bold">Annotators</emphasis>. The analysis algorithm developer&apos;s
-        primary concern therefore is the development of annotators. The UIMA framework
-        provides the necessary methods for taking annotators and creating analysis
-        engines.</para>
-      
-      <para>In UIMA the person who codes analysis algorithms takes on the role of the
-          <emphasis role="bold">Annotator Developer</emphasis>. <olink
-          targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.aae"/> will take the reader
-        through the details involved in creating UIMA annotators and analysis
-        engines.</para>
-      
-      <para>At the most primitive level an AE wraps an annotator adding the necessary APIs and
-        infrastructure for the composition and deployment of annotators within the UIMA
-        framework. The simplest AE contains exactly one annotator at its core. Complex AEs
-        may contain a collection of other AEs each potentially containing within them other
-        AEs. </para>
-    </section>
-    
-    <section id="ugr.ovv.conceptual.representing_results_in_cas">
-      <title>Representing Analysis Results in the CAS</title>
-      
-      <para>How annotators represent and share their results is an important part of the UIMA
-        architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure
-        (CAS)</emphasis> precisely for these purposes.</para>
-      
-      <para>The CAS is an object-based data structure that allows the representation of
-        objects, properties and values. Object types may be related to each other in a
-        single-inheritance hierarchy. The CAS logically (if not physically) contains the
-        document being analyzed. Analysis developers share and record their analysis
-        results in terms of an object model within the CAS. <footnote><para> We have plans to
-        extend the representational capabilities of the CAS and align its semantics with the
-        semantics of the OMG&apos;s Essential Meta-Object Facility (EMOF) and with the
-        semantics of the Eclipse Modeling Framework&apos;s ( <ulink
-          url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based
-        representation.</para> </footnote> </para>
-      
-      <para>The UIMA framework includes an implementation and interfaces to the CAS. For a
-        more detailed description of the CAS and its interfaces see <olink
-          targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para>
-      
-      <para>A CAS that logically contains statement 2 (repeated here for your
-        convenience)</para>
-      
-      
-      <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
-      
-      <para>would include objects of the Person type. For each person found in the body of a
-        document, the AE would create a Person object in the CAS and link it to the span of text
-        where the person was mentioned in the document.</para>
-      
-      <para>While the CAS is a general purpose data structure, UIMA defines a
-        few basic types and affords the developer the ability to extend these to define an
-        arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a
-        type system as an object schema for the CAS.</para>
-      
-      <para>A type system defines the various types of objects that may be discovered in 
-        documents by AE's that subscribe to that type system.</para>
-      
-      <para>As suggested above, Person may be defined as a type. Types have properties or
-          <emphasis role="bold">features</emphasis>. So for example,
-        <emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as
-        features of the Person type.</para>
-      
-      <para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money,
-        Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun
-        Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para>
-      
-      <para>There are no limits to the different types that may be defined in a type system. A
-        type system is domain and application specific.</para>
-      
-      <para>Types in a UIMA type system may be organized into a taxonomy. For example,
-        <emphasis>Company</emphasis> may be defined as a subtype of
-        <emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a
-        subtype of a <emphasis>ParseNode</emphasis>.</para>
-      
-      <section id="ugr.ovv.conceptual.annotation_type">
-        <title>The Annotation Type</title>
-        
-        <para>A general and common type used in artifact analysis and from which additional
-          types are often derived is the <emphasis role="bold">annotation</emphasis>
-          type. </para>
-        
-        <para>The annotation type is used to annotate or label regions of an artifact. Common
-          artifacts are text documents, but they can be other things, such as audio streams.
-          The annotation type for text includes two features, namely
-          <emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these
-          features represent integer offsets in the artifact and delimit a span. Any
-          particular annotation object identifies the span it annotates with the
-          <emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para>
-        
-        <para>The key idea here is that the annotation type is used to identify and label or
-          <quote>annotate</quote> a specific region of an artifact.</para>
-        
-        <para>Consider that the Person type is defined as a subtype of annotation. An
-          annotator, for example, can create a Person annotation to record the discovery of a
-          mention of a person between position 141 and 143 in document D102. The annotator can
-          create another person annotation to record the detection of a mention of a person in
-          the span between positions 101 and 112. </para>
-      </section>
-      <section id="ugr.ovv.conceptual.not_just_annotations">
-        <title>Not Just Annotations</title>
-        
-        <para>While the annotation type is a useful type for annotating regions of a
-          document, annotations are not the only kind of types in a CAS. A CAS is a general
-          representation scheme and may store arbitrary data structures to represent the
-          analysis of documents.</para>
-        
-        <para>As an example, consider statement 3 above (repeated here for your
-          convenience).</para>
-        
-        
-        <programlisting>(3) The Person denoted by span 101 to 112 and 
-  the Person denoted by span 141 to 143 in document D102 
-  refer to the same Entity.</programlisting>
-        
-        <para>This statement mentions two person annotations in the CAS; the first, call it
-          P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span
-          from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the
-          same entity. This means that while there are two expressions in the text
-          represented by the annotations P1 and P2, each refers to one and the same person.
-          </para>
-        
-        <para>The Entity type may be introduced into a type system to capture this kind of
-          information. The Entity type is not an annotation. It is intended to represent an
-          object in the domain which may be referred to by different expressions (or
-          mentions) occurring multiple times within a document (or across documents within
-          a collection of documents). The Entity type has a feature named
-          <emphasis>occurrences. </emphasis>This feature is used to point to all the
-          annotations believed to label mentions of the same entity.</para>
-        
-        <para>Consider that the spans annotated by P1 and P2 were <quote>Fred
-          Center</quote> and <quote>He</quote> respectively. The annotator might create
-          a new Entity object called
-          <code>FredCenter</code>. To represent the relationship in statement 3 above,
-          the annotator may link FredCenter to both P1 and P2 by making them values of its
-          <emphasis>occurrences</emphasis> feature.</para>
-        
-        <para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also
-          illustrates that an entity may be linked to annotations referring to regions of
-          image documents as well. To do this the annotation type would have to be extended
-          with the appropriate features to point to regions of an image.</para>
-      </section>
-      
-      <section id="ugr.ovv.conceptual.multiple_views_within_a_cas">
-        <title>Multiple Views within a CAS</title>
-        
-        <para>UIMA supports the simultaneous analysis of multiple views of a document. This
-          support comes in handy for processing multiple forms of the artifact, for example, the audio
-          and the closed captioned views of a single speech stream, or the tagged and detagged 
-          views of an HTML document.</para>
-        
-        <para>AEs analyze one or more views of a document. Each view contains a specific
-            <emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of
-          indexes holding metadata indexed by that view. The CAS, overall, holds one or more
-          CAS Views, plus the descriptive objects that represent the analysis results for
-          each. </para>
-        
-        <para>Another common example of using CAS Views is for different translations of a
-          document. Each translation may be represented with a different CAS View. Each
-          translation may be described by a different set of analysis results. For more
-          details on CAS Views and Sofas see <olink
-            targetdoc="&uima_docs_tutorial_guides;"
-            targetptr="ugr.tug.mvs"/> and <olink
-            targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para>
-      </section>
-    </section>
-    
-    <section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources">
-      <title>Interacting with the CAS and External Resources</title>
-      <titleabbrev>Using CASes and External Resources</titleabbrev>
-      
-      <para>The two main interfaces that a UIMA component developer interacts with are the
-        CAS and the UIMA Context.</para>
-      
-      <para>UIMA provides an efficient implementation of the CAS with multiple programming
-        interfaces. Through these interfaces, the annotator developer interacts with the
-        document and reads and writes analysis results. The CAS interfaces provide a suite of
-        access methods that allow the developer to obtain indexed iterators to the different
-        objects in the CAS. See <olink targetdoc="&uima_docs_ref;"
-          targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator
-        developer can obtain a specialized iterator to all Person objects associated with a
-        particular view, for example.</para>
-      
-      <para>For Java annotator developers, UIMA provides the JCas. This interface provides
-        the Java developer with a natural interface to CAS objects. Each type declared in the
-        type system appears as a Java Class; the UIMA framework renders the Person type as a
-        Person class in Java. As the analysis algorithm detects mentions of persons in the
-        documents, it can create Person objects in the CAS. For more details on how to interact
-        with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;"
-          targetptr="ugr.ref.jcas"/>.</para>
-      
-      <para>The component developer, in addition to interacting with the CAS, can access
-        external resources through the framework&apos;s resource manager interface
-        called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among
-        other things, can ensure that different annotators working together in an aggregate
-        flow may share the same instance of an external file, for example. For details on using
-        the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.aae"/>.</para>
-      
-    </section>
-    <section id="ugr.ovv.conceptual.component_descriptors">
-      <title>Component Descriptors</title>
-      <para>UIMA defines interfaces for a small set of core components that users of the
-        framework provide implmentations for. Annotators and Analysis Engines are two of
-        the basic building blocks specified by the architecture. Developers implement them
-        to build and compose analysis capabilities and ultimately applications.</para>
-      
-      <para>There are others components in addition to these, which we will learn about
-        later, but for every component specified in UIMA there are two parts required for its
-        implementation:</para>
-      
-      <orderedlist spacing="compact">
-        <listitem><para>the declarative part and</para></listitem>
-        <listitem><para>the code part.</para></listitem>
-      </orderedlist>
-      
-      <para>The declarative part contains metadata describing the component, its
-        identity, structure and behavior and is called the <emphasis role="bold">
-        Component Descriptor</emphasis>. Component descriptors are represented in XML.
-        The code part implements the algorithm. The code part may be a program in Java.</para>
-      
-      <para>As a developer using the UIMA SDK, to implement a UIMA component it is always the
-        case that you will provide two things: the code part and the Component Descriptor.
-        Note that when you are composing an engine, the code may be already provided in
-        reusable subcomponents. In these cases you may not be developing new code but rather
-        composing an aggregate engine by pointing to other components where the code has been
-        included.</para>
-      
-      <para>Component descriptors are represented in XML and aid in component discovery,
-        reuse, composition and development tooling. The UIMA SDK provides tools for easily
-        creating and maintaining the component descriptors that relieve the developer from
-        editing XML directly. This tool is described briefly in <olink
-          targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.aae"/>, and more
-        thoroughly in <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>
-        .</para>
-      
-      <para>Component descriptors contain standard metadata including the
-        component&apos;s name, author, version, and a reference to the class that
-        implements the component.</para>
-      
-      <para>In addition to these standard fields, a component descriptor identifies the
-        type system the component uses and the types it requires in an input CAS and the types it
-        plans to produce in an output CAS.</para>
-      
-      <para>For example, an AE that detects person types may require as input a CAS that
-        includes a tokenization and deep parse of the document. The descriptor refers to a
-        type system to make the component&apos;s input requirements and output types
-        explicit. In effect, the descriptor includes a declarative description of the
-        component&apos;s behavior and can be used to aid in component discovery and
-        composition based on desired results. UIMA analysis engines provide an interface
-        for accessing the component metadata represented in their descriptors. For more
-        details on the structure of UIMA component descriptors refer to <olink
-          targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.component_descriptor"/>.</para>
-      
-    </section>
-  </section>
-  <section id="ugr.ovv.conceptual.aggregate_analysis_engines">
-    <title>Aggregate Analysis Engines</title>
-    
-    <note><title>&key_concepts;</title><para>Aggregate Analysis Engine, Delegate Analysis Engine,
-      Tightly and Loosely Coupled, Flow Specification, Analysis Engine Assembler</para>
-    </note>
-    
-    <figure id="ugr.ovv.conceptual.sample_aggregate">
-      <title>Sample Aggregate Analysis Engine</title>
-      <mediaobject>
-        <imageobject role="html">
-          <imagedata width="588px" format="PNG" fileref="&imgroot;image006.png"/>
-        </imageobject>
-        <imageobject role="fo">
-          <imagedata width="5.5in" format="PNG" fileref="&imgroot;image006.png"/>
-        </imageobject>
-        <textobject><phrase>Picture of multiple parts (a language identifier,
-          tokenizer, part of speech annotator, shallow parser, and named entity detector)
-          strung together into a flow, and all of them wrapped as a single aggregate object,
-          which produces as annotations the union of all the results of the individual
-          annotator components ( tokens, parts of speech, names, organizations, places,
-          persons, etc.)</phrase>
-        </textobject>
-      </mediaobject>
-    </figure>
-    
-    <para>A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs,
-      however, may be defined to contain other AEs organized in a workflow. These more complex
-      analysis engines are called <emphasis role="bold">Aggregate Analysis
-      Engines.</emphasis> </para>
-    
-    <para>Annotators tend to perform fairly granular functions, for example language
-      detection, tokenization or part of speech detection. 
-    These functions typically address just part of an overall analysis task. A workflow 
-      of component engines may be orchestrated to perform more complex tasks.</para>
-    
-    <para>An AE that performs named entity detection, for example, may
-      include a pipeline of annotators starting with language detection feeding
-      tokenization, then part-of-speech detection, then deep grammatical parsing and then
-      finally named-entity detection. Each step in the pipeline is required by the
-      subsequent analysis. For example, the final named-entity annotator can only do its
-      analysis if the previous deep grammatical parse was recorded in the CAS.</para>
-    
-    <para>Aggregate AEs are built to encapsulate potentially complex internal structure
-      and insulate it from users of the AE. In our example, the aggregate analysis engine
-      developer acquires the internal components, defines the necessary flow
-      between them and publishes the resulting AE. Consider the simple example illustrated
-      in <xref linkend="ugr.ovv.conceptual.sample_aggregate"/> where
-      <quote>MyNamed-EntityDetector</quote> is composed of a linear flow of more
-      primitive analysis engines.</para>
-    
-    <para>Users of this AE need not know how it is constructed internally but only need its name
-      and its published input requirements and output types. These must be declared in the
-      aggregate AE&apos;s descriptor. Aggregate AE&apos;s descriptors declare the components
-      they contain and a <emphasis role="bold">flow specification</emphasis>. The flow
-      specification defines the order in which the internal component AEs should be run. The
-      internal AEs specified in an aggregate are also called the <emphasis role="bold">
-      delegate analysis engines.</emphasis> The term "delegate" is used because aggregate AE's 
-      are thought to "delegate" functions to their internal AEs.</para>
-    
-    <para>
-      In UIMA 2.0, the developer can implement a "Flow Controller" and include it as part 
-      of an aggregate AE by referring to it in the aggregate AE's descriptor. 
-      The flow controller is responsible for computing the "flow", that is, 
-      for determining the order in which of delegate AE's that will process the CAS. 
-      The Flow Contoller has access to the CAS and any external resources it may require 
-      for determining the flow. It can do this dynamically at run-time, it can 
-      make multi-step decisions and it can consider any sort of flow specification 
-      included in the aggregate AE's descriptor. See 
-      <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc"/> 
-      for details on the UIMA Flow Controller interface.
-    </para>
-    
-    <para>We refer to the development role associated with building an aggregate from
-      delegate AEs as the <emphasis role="bold">Analysis Engine Assembler</emphasis>
-      .</para>
-    
-    <para>The UIMA framework, given an aggregate analysis engine descriptor, will run all
-      delegate AEs, ensuring that each one gets access to the CAS in the sequence produced by
-      the flow controller. The UIMA framework is equipped to handle different
-      deployments where the delegate engines, for example, are <emphasis role="bold">
-      tightly-coupled</emphasis> (running in the same process) or <emphasis role="bold">
-      loosely-coupled</emphasis> (running in separate processes or even on different
-      machines). The framework supports a number of remote protocols for loose coupling
-      deployments of aggregate analysis engines, including SOAP (which stands for Simple
-      Object Access Protocol, a standard Web Services communications protocol).</para>
-    
-    <para>The UIMA framework facilitates the deployment of AEs as remote services by using an
-      adapter layer that automatically creates the necessary infrastructure in response to
-      a declaration in the component&apos;s descriptor. For more details on creating
-      aggregate analysis engines refer to <olink targetdoc="&uima_docs_ref;"
-        targetptr="ugr.ref.xml.component_descriptor"/> The component descriptor editor tool
-      assists in the specification of aggregate AEs from a repository of available engines.
-      For more details on this tool refer to <olink targetdoc="&uima_docs_tools;"
-        targetptr="ugr.tools.cde"/>.</para>
-    
-    <para>The UIMA framework implementation has two built-in flow implementations: one
-      that support a linear flow between components, and one with conditional branching
-      based on the language of the document. It also supports user-provided flow
-      controllers, as described in <olink targetdoc="&uima_docs_tutorial_guides;"
-        targetptr="ugr.tug.fc"/>. Furthermore, the application developer is
-      free to create multiple AEs and provide their own logic to combine the AEs in arbitrarily
-      complex flows. For more details on this the reader may refer to <olink
-        targetdoc="&uima_docs_tutorial_guides;"
-        targetptr="ugr.tug.application.using_aes"/>.</para>
-    
-  </section>
-  
-  <section id="ugr.ovv.conceptual.applicaiton_building_and_collection_processing">
-    <title>Application Building and Collection Processing</title>
-    
-    <note><title>&key_concepts;</title><para>Process Method, Collection Processing Architecture,
-      Collection Reader, CAS Consumer, CAS Initializer, Collection Processing Engine,
-      Collection Processing Manager.</para></note>
-    
-    <section id="ugr.ovv.conceptual.using_framework_from_an_application">
-      <title>Using the framework from an Application</title>
-      
-      <figure id="ugr.ovv.conceptual.application_factory_ae">
-        <title>Using UIMA Framework to create and interact with an Analysis Engine</title>
-        <mediaobject>
-          <imageobject role="html">
-            <imagedata width="618px" align="center" format="PNG" fileref="&imgroot;image008.png"/>
-          </imageobject>
-          <imageobject role="fo">
-            <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image008.png"/>
-          </imageobject>
-          <textobject><phrase>Picture of application interacting with UIMA&apos;s
-            factory to produce an analysis engine, which acts as a container for annotators,
-            and interfaces with the application via the process and getMetaData methods
-            among others.</phrase>
-          </textobject>
-        </mediaobject>
-      </figure>
-      
-      <para>As mentioned above, the basic AE interface may be thought of as simply CAS in/CAS
-        out.</para>
-      
-      <para>The application is responsible for interacting with the UIMA framework to
-        instantiate an AE, create or acquire an input CAS, initialize the input CAS with a
-        document and then pass it to the AE through the <emphasis role="bold">process
-        method</emphasis>. This interaction with the framework is illustrated in <xref
-          linkend="ugr.ovv.conceptual.application_factory_ae"/>. </para>
-      
-      <para>The UIMA AE Factory takes the declarative information from the Component
-        Descriptor and the class files implementing the annotator, and instantiates the AE
-        instance, setting up the CAS and the UIMA Context.</para>
-      
-      <para>The AE, possibly calling many delegate AEs internally, performs the overall
-        analysis and its process method returns the CAS containing new analysis results.
-        </para>
-      
-      <para>The application then decides what to do with the returned CAS. There are many
-        possibilities. For instance the application could: display the results, store the
-        CAS to disk for post processing, extract and index analysis results as part of a search
-        or database application etc.</para>
-      
-      <para>The UIMA framework provides methods to support the application developer in
-        creating and managing CASes and instantiating, running and managing AEs. Details
-        may be found in <olink targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.application"/>.</para>
-    </section>
-    
-    <section id="ugr.ovv.conceptual.graduating_to_collection_processing">
-      <title>Graduating to Collection Processing</title>
-      <figure id="ugr.ovv.conceptual.fig.cpe">
-        <title>High-Level UIMA Component Architecture from Source to Sink</title>
-        <mediaobject>
-          <imageobject role="html">
-            <imagedata width="578px" format="PNG" align="center" fileref="&imgroot;image010.png"/>
-          </imageobject>
-          <imageobject role="fo">
-            <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image010.png"/>
-          </imageobject>
-        </mediaobject>
-      </figure>
-      
-      <para>Many UIM applications analyze entire collections of documents. They connect to
-        different document sources and do different things with the results. But in the
-        typical case, the application must generally follow these logical steps:
-        
-        <orderedlist spacing="compact">
-          <listitem><para>Connect to a physical source</para></listitem>
-          <listitem><para>Acquire a document from the source</para></listitem>
-          <listitem><para>Initialize a CAS with the document to be analyzed</para>
-            </listitem>
-          <listitem><para>Send the CAS to a selected analysis engine</para></listitem>
-          <listitem><para>Process the resulting CAS</para></listitem>
-          <listitem><para>Go back to 2 until the collection is processed</para>
-            </listitem>
-          <listitem><para>Do any final processing required after all the documents in the
-            collection have been analyzed</para></listitem>
-        </orderedlist> </para>
-      
-      <para>UIMA supports UIM application development for this general type of processing
-        through its <emphasis role="bold">Collection Processing
-        Architecture</emphasis>.</para>
-      
-      <para>As part of the collection processing architecture UIMA introduces two primary
-        components in addition to the annotator and analysis engine. These are the <emphasis
-          role="bold">Collection Reader</emphasis> and the <emphasis role="bold">CAS
-        Consumer</emphasis>. The complete flow from source, through document analysis,
-        and to CAS Consumers supported by UIMA is illustrated in <xref
-          linkend="ugr.ovv.conceptual.fig.cpe"/>.</para>
-      
-      <para>The Collection Reader&apos;s job is to connect to and iterate through a source
-        collection, acquiring documents and initializing CASes for analysis. </para>
-      
-      <!--
-      <para>Since the structure, access and iteration methods for
-      physical document sources vary independently from the format of stored
-      documents, UIMA defines another type of component called a <emphasis role="bold">CAS Intializer</emphasis>.  
-      The CAS Initializer&apos;s job is specific to a
-      document format and specialized logic for mapping that format to a CAS. In the
-      simplest case a CAS Intializer may take the document provided by the containing
-      Collection Reader and insert it as a subject of analysis (or Sofa) in the
-      CAS.  A more advanced scenario is one
-      where the CAS Intializer may be implemented to handle documents that conform to
-      a certain XML schema and map some subset of the XML tags to CAS types and then
-      insert the de-tagged document content as the subject of analysis.  Collection Readers may reuse plug-in CAS
-      Initializers for different document formats.</para>
-      -->
-      
-      <para>CAS Consumers, as the name suggests, function at the end of the flow. Their job is
-        to do the final CAS processing. A CAS Consumer may be implemented, for example, to
-        index CAS contents in a search engine, extract elements of interest and populate a
-        relational database or serialize and store analysis results to disk for subsequent
-        and further analysis. </para>
-      
-      <para>A Semantic Search engine that works with UIMA is available from <ulink
-          url="http://www.alphaworks.ibm.com/tech/uima">IBM&apos;s alphaWorks
-        site</ulink> which will allow the developer to experiment with indexing analysis
-        results and querying for documents based on all the annotations in the CAS. See the
-        section on integrating text analysis and search in <olink
-          targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.application"/>.</para>
-      
-      <para>A UIMA <emphasis role="bold">Collection Processing Engine</emphasis> (CPE)
-        is an aggregate component that specifies a <quote>source to sink</quote> flow from a
-        Collection Reader though a set of analysis engines and then to a set of CAS Consumers.
-        </para>
-      
-      <para>CPEs are specified by XML files called CPE Descriptors. These are declarative
-        specifications that point to their contained components (Collection Readers,
-        analysis engines and CAS Consumers) and indicate a flow among them. The flow
-        specification allows for filtering capabilities to, for example, skip over AEs
-        based on CAS contents. Details about the format of CPE Descriptors may be found in
-          <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>.
-        </para>
-      
-      <figure id="ugr.ovv.conceptual.fig.cpm">
-        <title>Collection Processing Manager in UIMA Framework</title>
-        <mediaobject>
-          <imageobject role="html">
-            <imagedata width="576px" align="center" format="PNG" fileref="&imgroot;image012.png"/>
-          </imageobject>
-          <imageobject role="fo">
-            <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image012.png"/>
-          </imageobject>
-          <textobject><phrase>box and arrows picture of application using CPE factory to
-            instantiate a Collection Processing Engine, and that engine interacting with
-            the application.</phrase></textobject>
-        </mediaobject>
-      </figure>
-      
-      <para>The UIMA framework includes a <emphasis role="bold">Collection Processing
-        Manager</emphasis> (CPM). The CPM is capable of reading a CPE descriptor, and
-        deploying and running the specified CPE. <xref
-          linkend="ugr.ovv.conceptual.fig.cpe"/> illustrates the role of the CPM
-        in the UIMA Framework.</para>
-      
-      <para>Key features of the CPM are failure recovery, CAS management and scale-out.
-        </para>
-      
-      <para>Collections may be large and take considerable time to analyze. A configurable
-        behavior of the CPM is to log faults on single document failures while continuing to
-        process the collection. This behavior is commonly used because analysis components
-        often tend to be the weakest link -- in practice they may choke on strangely formatted
-        content. </para>
-      
-      <para>This deployment option requires that the CPM run in a separate process or a
-        machine distinct from the CPE components. A CPE may be configured to run with a variety
-        of deployment options that control the features provided by the CPM. For details see
-          <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>
-        .</para>
-      
-      <para>The UIMA SDK also provides a tool called the CPE Configurator. This tool provides
-        the developer with a user interface that simplifies the process of connecting up all
-        the components in a CPE and running the result. For details on using the CPE
-        Configurator see <olink targetdoc="&uima_docs_tools;"
-          targetptr="ugr.tools.cpe"/>. This tool currently does not provide
-        access to the full set of CPE deployment options supported by the CPM; however, you can
-        configure other parts of the CPE descriptor by editing it directly. For details on how
-        to create and run CPEs refer to <olink targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.cpe"/>.</para>
-      
-    </section>
-    
-  </section>
-  
-  <section id="ugr.ovv.conceptual.exploiting_analysis_results">
-    <title>Exploiting Analysis Results</title>
-    
-    <note><title>&key_concepts;</title><para>Semantic Search, XML Fragment Queries.</para>
-    </note>
-    
-    <section id="ugr.ovv.conceptual.semantic_search">
-      <title>Semantic Search</title>
-      
-      <para>In a simple UIMA Collection Processing Engine (CPE), a Collection Reader reads
-        documents from the file system and initializes CASs with their content. These are
-        then fed to an AE that annotates tokens and sentences, the CASs, now enriched with
-        token and sentence information, are passed to a CAS Consumer that populates a search
-        engine index. </para>
-      
-      <para>The search engine query processor can then use the token index to provide basic
-        key-word search. For example, given a query <quote>center</quote> the search
-        engine would return all the documents that contained the word
-        <quote>center</quote>.</para>
-      
-      <para><emphasis role="bold">Semantic Search</emphasis> is a search paradigm that
-        can exploit the additional metadata generated by analytics like a UIMA CPE.</para>
-      
-      <para>Consider that we plugged a named-entity recognizer into the CPE described
-        above. Assume this analysis engine is capable of detecting in documents and
-        annotating in the CAS mentions of persons and organizations.</para>
-      
-      <para>Complementing the name-entity recognizer we add a CAS Consumer that extracts in
-        addition to token and sentence annotations, the person and organizations added to
-        the CASs by the name-entity detector. It then feeds these into the semantic search
-        engine&apos;s index.</para>
-      
-      <para>The semantic search engine that comes with the UIMA SDK, for example, can exploit
-        this addition information from the CAS to support more powerful queries. For
-        example, imagine a user is looking for documents that mention an organization with
-        <quote>center</quote> it is name but is not sure of the full or precise name of the
-        organization. A key-word search on <quote>center</quote> would likely produce way
-        too many documents because <quote>center</quote> is a common and ambiguous term.
-        The semantic search engine that is available from <ulink
-          url="http://www.alphaworks.ibm.com/tech/uima"/> supports a query language
-        called <emphasis role="bold">XML Fragments</emphasis>. This query language is
-        designed to exploit the CAS annotations entered in its index. The XML Fragment query,
-        for example,
-        
-        
-        <programlisting>&lt;organization&gt; center &lt;/organization&gt;</programlisting>
-        will produce first only documents that contain <quote>center</quote> where it
-        appears as part of a mention annotated as an organization by the name-entity
-        recognizer. This will likely be a much shorter list of documents more precisely
-        matching the user&apos;s interest.</para>
-      
-      <para>Consider taking this one step further. We add a relationship recognizer that
-        annotates mentions of the CEO-of relationship. We configure the CAS Consumer so that
-        it sends these new relationship annotations to the semantic search index as well.
-        With these additional analysis results in the index we can submit queries like
-        
-        
-        <programlisting>&lt;ceo_of&gt;
-    &lt;person&gt; center &lt;/person&gt;
-    &lt;organization&gt; center &lt;/organization&gt;
-&lt;ceo_of&gt;</programlisting>
-        This query will precisely target documents that contain a mention of an organization
-        with <quote>center</quote> as part of its name where that organization is mentioned
-        as part of a
-        <code>CEO-of</code> relationship annotated by the relationship
-        recognizer.</para>
-      
-      <para>For more details about using UIMA and Semantic Search see the section on
-        integrating text analysis and search in <olink
-          targetdoc="&uima_docs_tutorial_guides;"
-          targetptr="ugr.tug.application"/>.</para>
-    </section>
-    
-    <section id="ugr.ovv.conceptual.databases">
-      <title>Databases</title>
-      
-      <para>Search engine indices are not the only place to deposit analysis results for use
-        by applications. Another classic example is populating databases. While many
-        approaches are possible with varying degrees of flexibly and performance all are
-        highly dependent on application specifics. We included a simple sample CAS Consumer
-        that provides the basics for getting your analysis result into a relational
-        database. It extracts annotations from a CAS and writes them to a relational
-        database, using the open source Apache Derby database.</para>
-    </section>
-  </section>
-  
-  <section id="ugr.ovv.conceptual.multimodal_processing">
-    <title>Multimodal Processing in UIMA</title>
-    <para>In previous sections we&apos;ve seen how the CAS is initialized with an initial
-      artifact that will be subsequently analyzed by Analysis engines and CAS Consumers. The
-      first Analysis engine may make some assertions about the artifact, for example, in the
-      form of annotations. Subsequent Analysis engines will make further assertions about
-      both the artifact and previous analysis results, and finally one or more CAS Consumers
-      will extract information from these CASs for structured information storage.</para>
-    <figure id="ugr.ovv.conceptual.fig.multiple_sofas">
-      <title>Multiple Sofas in support of multi-modal analysis of an audio Stream. Some
-        engines work on the audio <quote>view</quote>, some on the text
-        <quote>view</quote> and some on both.</title>
-      <mediaobject>
-        <imageobject role="html">
-          <imagedata width="576px" format="PNG" align="center" fileref="&imgroot;image014.png"/>
-        </imageobject>
-        <imageobject role="fo">
-          <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image014.png"/>
-        </imageobject>
-        <textobject><phrase>Picture showing audio on the left broken into segments by a
-          segmentation component, then sent to multiple analysis pipelines in parallel,
-          some processing the raw audio, others processing the recognized speech as
-          text.</phrase></textobject>
-      </mediaobject>
-    </figure>
-    <para>Consider a processing pipeline, illustrated in <xref
-        linkend="ugr.ovv.conceptual.fig.multiple_sofas"/>, that starts with an
-      audio recording of a conversation, transcribes the audio into text, and then extracts
-      information from the text transcript. Analysis Engines at the start of the pipeline are
-      analyzing an audio subject of analysis, and later analysis engines are analyzing a text
-      subject of analysis. The CAS Consumer will likely want to build a search index from
-      concepts found in the text to the original audio segment covered by the concept.</para>
-    
-    <para>What becomes clear from this relatively simple scenario is that the CAS must be
-      capable of simultaneously holding multiple subjects of analysis. Some analysis
-      engine will analyze only one subject of analysis, some will analyze one and create
-      another, and some will need to access multiple subjects of analysis at the same time.
-      </para>
-    
-    <para>The support in UIMA for multiple subjects of analysis is called <emphasis
-        role="bold">Sofa</emphasis> support; Sofa is an acronym which is derived from
-        <emphasis role="underline">S</emphasis>ubject <emphasis role="underline">
-      of</emphasis> <emphasis role="underline">A</emphasis>nalysis, which is a physical 
-      representation of an artifact (e.g., the detagged text of a web-page, the HTML 
-      text of the same web-page, the audio segment of a video, the close-caption text 
-      of the same audio segment). A Sofa may
-      be associated with CAS Views. A particular CAS will have one or more views, each view
-      corresponding to a particular subject of analysis, together with a set of the defined
-      indexes that index the metadata created in that view.</para>
-    
-    <para>Analysis results can be indexed in, or <quote>belong</quote> to, a specific view.
-      UIMA components may be written in <quote>Multi-View</quote> mode - able to create and
-      access multiple Sofas at the same time, or in <quote>Single-View</quote> mode, simply
-      receiving a particular view of the CAS corresponding to a particular single Sofa. For
-      single-view mode components, it is up to the person assembling the component to supply
-      the needed information to insure a particular view is passed to the component at run
-      time. This is done using XML descriptors for Sofa mapping (see <olink
-        targetdoc="&uima_docs_tutorial_guides;"
-        targetptr="ugr.tug.mvs.sofa_name_mapping"/>).</para>
-    
-    <para>Multi-View capability brings benefits to text-only processing as well. An input
-      document can be transformed from one format to another. Examples of this include
-      transforming text from HTML to plain text or from one natural language to another.
-      </para>
-  </section>
-  
-  <section id="ugr.ovv.conceptual.next_steps">
-    <title>Next Steps</title>
-    
-    <para>This chapter presented a high-level overview of UIMA concepts. Along the way, it
-      pointed to other documents in the UIMA SDK documentation set where the reader can find
-      details on how to apply the related concepts in building applications with the UIMA
-      SDK.</para>
-    
-    <para>At this point the reader may return to the documentation guide in <olink
-        targetdoc="&uima_docs_overview;" targetptr="ugr.project_overview_doc_use"/>
-      to learn how they might proceed in getting started using UIMA.</para>
-    
-    <para>For a more detailed overview of the UIMA architecture, framework and development
-      roles we refer the reader to the following paper:</para>
-    
-    <para>D. Ferrucci and A. Lally, <quote>Building an example application using the
-      Unstructured Information Management Architecture,</quote> <emphasis>IBM Systems
-      Journal</emphasis> <emphasis role="bold">43</emphasis>, No. 3, 455-475 (2004).
-      </para>
-    
-    <para>This paper can be found on line at <ulink
-        url="http://www.research.ibm.com/journal/sj43-3.html"/></para>
-  </section>
-  
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
+"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
+<!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:">
+<!ENTITY imgroot "../images/overview_and_setup/conceptual_overview_files/" >
+<!ENTITY % uimaents SYSTEM "../entities.ent" >  
+%uimaents;
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<chapter id="ugr.ovv.conceptual">
+  <title>UIMA Conceptual Overview</title>
+  
+  <para>UIMA is an open, industrial-strength, scaleable and extensible platform for
+    creating, integrating and deploying unstructured information management solutions
+    from powerful text or multi-modal analysis and search components. </para>
+  
+  <para>The Apache UIMA project is an implementation of the Java UIMA framework available
+    under the Apache License, providing a common foundation for industry and academia to
+    collaborate and accelerate the world-wide development of technologies critical for
+    discovering vital knowledge present in the fastest growing sources of information
+    today.</para>
+  
+  <para>This chapter presents an introduction to many essential UIMA concepts. It is meant to
+    provide a broad overview to give the reader a quick sense of UIMA&apos;s basic
+    architectural philosophy and the UIMA SDK&apos;s capabilities. </para>
+  
+  <para>This chapter provides a general orientation to UIMA and makes liberal reference to
+    the other chapters in the UIMA SDK documentation set, where the reader may find detailed
+    treatments of key concepts and development practices. It may be useful to refer to <olink
+      targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar
+    with the terminology in this overview.</para>
+  
+  <section id="ugr.ovv.conceptual.uima_introduction">
+    <title>UIMA Introduction</title>
+    <figure id="ugr.ovv.conceptual.fig.bridge">
+      <title>UIMA helps you build the bridge between the unstructured and structured
+        worlds</title>
+      <mediaobject>
+        <imageobject>
+          <imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/>
+        </imageobject>
+        <textobject><phrase>Picture of a bridge between unstructured information
+          artifacts and structured metadata about those artifacts</phrase>
+        </textobject>
+      </mediaobject>
+    </figure>
+    
+    <para> Unstructured information represents the largest, most current and fastest
+      growing source of information available to businesses and governments. The web is just
+      the tip of the iceberg. Consider the mounds of information hosted in the enterprise and
+      around the world and across different media including text, voice and video. The
+      high-value content in these vast collections of unstructured information is,
+      unfortunately, buried in lots of noise. Searching for what you need or doing
+      sophisticated data mining over unstructured information sources presents new
+      challenges. </para>
+    
+    <para>An unstructured information management (UIM) application may be generally
+      characterized as a software system that analyzes large volumes of unstructured
+      information (text, audio, video, images, etc.) to discover, organize and deliver
+      relevant knowledge to the client or application end-user. An example is an application
+      that processes millions of medical abstracts to discover critical drug interactions.
+      Another example is an application that processes tens of millions of documents to
+      discover key evidence indicating probable competitive threats. </para>
+    
+    <para>First and foremost, the unstructured data must be analyzed to interpret, detect
+      and locate concepts of interest, for example, named entities like persons,
+      organizations, locations, facilities, products etc., that are not explicitly tagged
+      or annotated in the original artifact. More challenging analytics may detect things
+      like opinions, complaints, threats or facts. And then there are relations, for
+      example, located in, finances, supports, purchases, repairs etc. The list of concepts 
+      important for applications to discover in unstructured content is large, varied and 
+      often domain specific. 
+      Many different component analytics may solve different parts of the overall analysis task. 
+      These component analytics must interoperate and must be easily combined to facilitate 
+      the developed of UIM applications.</para>
+    
+    <para>The result of analysis are used to populate structured forms so that conventional 
+      data processing and search technologies 
+      like search engines, database engines or OLAP
+      (On-Line Analytical Processing, or Data Mining) engines 
+      can efficiently deliver the newly discovered content in response to the client requests 
+      or queries.</para>
+    
+    <para>In analyzing unstructured content, UIM applications make use of a variety of
+      analysis technologies including:</para>
+    
+    <itemizedlist spacing="compact">
+      <listitem><para>Statistical and rule-based Natural Language Processing
+        (NLP)</para>
+      </listitem>
+      <listitem><para>Information Retrieval (IR)</para>
+      </listitem>
+      <listitem><para>Machine learning</para>
+      </listitem>
+      <listitem><para>Ontologies</para>
+      </listitem>
+      <listitem><para>Automated reasoning and</para>
+      </listitem>
+      <listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para>
+      </listitem>
+      
+    </itemizedlist>
+    
+    <para>Specific analysis capabilities using these technologies are developed 
+      independently using different techniques, interfaces and platforms.
+      </para>
+    
+    <para>The bridge from the unstructured world to the structured world is built through the
+      composition and deployment of these analysis capabilities. This integration is often
+      a costly challenge. </para>
+    
+    <para>The Unstructured Information Management Architecture (UIMA) is an architecture
+      and software framework that helps you build that bridge. It supports creating,
+      discovering, composing and deploying a broad range of analysis capabilities and
+      linking them to structured information services.</para>
+    
+    <para>UIMA allows development teams to match the right skills with the right parts of a
+      solution and helps enable rapid integration across technologies and platforms using a
+      variety of different deployment options. These ranging from tightly-coupled
+      deployments for high-performance, single-machine, embedded solutions to parallel
+      and fully distributed deployments for highly flexible and scaleable
+      solutions.</para>
+    
+  </section>
+  
+  <section id="ugr.ovv.conceptual.architecture_framework_sdk">
+    <title>The Architecture, the Framework and the SDK</title>
+    <para>UIMA is a software architecture which specifies component interfaces, data
+      representations, design patterns and development roles for creating, describing,
+      discovering, composing and deploying multi-modal analysis capabilities.</para>
+    
+    <para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time
+      environment in which developers can plug in their UIMA component implementations and
+      with which they can build and deploy UIM applications. The framework is not specific to
+      any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA
+      Framework.</para>
+    
+    <para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis>
+      includes the UIMA framework, plus tools and utilities for using UIMA. Some of the
+      tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>)
+      development environment. </para>
+    
+  </section>
+  
+  <section id="ugr.ovv.conceptual.analysis_basics">
+    <title>Analysis Basics</title>
+    <note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator
+      Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA
+      Context.</para>
+    </note>
+    
+    <section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results">
+      <title>Analysis Engines, Annotators &amp; Results</title>
+      <figure id="ugr.ovv.conceptual.metadata_in_cas">
+        <title>Objects represented in the Common Analysis Structure (CAS)</title>
+        <mediaobject>
+          <imageobject role="html">
+            <imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/>
+          </imageobject>
+          <imageobject role="fo">
+            <imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/>
+          </imageobject>          
+          <textobject><phrase>Picture of some text, with a hierarchy of discovered
+            metadata about words in the text, including some image of a person as metadata
+            about that name.</phrase>
+          </textobject>
+        </mediaobject>
+      </figure>
+      
+      <para>UIMA is an architecture in which basic building blocks called Analysis Engines
+        (AEs) are composed to analyze a document and infer and record descriptive attributes
+        about the document as a whole, and/or about regions therein. This descriptive
+        information, produced by AEs is referred to generally as <emphasis role="bold">
+        analysis results</emphasis>. Analysis results typically represent meta-data
+        about the document content. One way to think about AEs is as software agents that
+        automatically discover and record meta-data about original content.</para>
+      
+      <para>UIMA supports the analysis of different modalities including text, audio and
+        video. The majority of examples we provide are for text. We use the term <emphasis
+          role="bold">document, </emphasis>therefore, to generally refer to any unit of
+        content that an AE may process, whether it is a text document or a segment of audio, for
+        example. See the section <olink targetdoc="&uima_docs_tutorial_guides;"
+          targetptr="ugr.tug.mvs"/> for more information on multimodal processing
+        in UIMA.</para>
+      
+      <para>Analysis results include different statements about the content of a document.
+        For example, the following is an assertion about the topic of a document:</para>
+      
+      
+      <programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting>
+      
+      <para>Analysis results may include statements describing regions more granular than
+        the entire document. We use the term <emphasis role="bold">span</emphasis> to
+        refer to a sequence of characters in a text document. Consider that a document with the
+        identifier D102 contains a span, <quote>Fred Centers</quote> starting at
+        character position 101. An AE that can detect persons in text may represent the
+        following statement as an analysis result:</para>
+      
+      
+      <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
+      
+      <para>In both statements 1 and 2 above there is a special pre-defined term or what we call
+        in UIMA a <emphasis role="bold">Type</emphasis>. They are
+        <emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively.
+        UIMA types characterize the kinds of results that an AE may create &ndash; more on
+        types later.</para>
+      
+      <para>Other analysis results may relate two statements. For example, an AE might
+        record in its results that two spans are both referring to the same person:</para>
+      
+      
+      <programlisting>(3) The Person denoted by span 101 to 112 and 
+  the Person denoted by span 141 to 143 in document D102 
+  refer to the same Entity.</programlisting>
+      
+      <para>The above statements are some examples of the kinds of results that AEs may record
+        to describe the content of the documents they analyze. These are not meant to indicate
+        the form or syntax with which these results are captured in UIMA &ndash; more on that
+        later in this overview.</para>
+      
+      <para>The UIMA framework treats Analysis engines as pluggable, composible,
+        discoverable, managed objects. At the heart of AEs are the analysis algorithms that
+        do all the work to analyze documents and record analysis results. </para>
+      
+      <para>UIMA provides a basic component type intended to house the core analysis
+        algorithms running inside AEs. Instances of this component are called <emphasis
+          role="bold">Annotators</emphasis>. The analysis algorithm developer&apos;s
+        primary concern therefore is the development of annotators. The UIMA framework
+        provides the necessary methods for taking annotators and creating analysis
+        engines.</para>
+      
+      <para>In UIMA the person who codes analysis algorithms takes on the role of the
+          <emphasis role="bold">Annotator Developer</emphasis>. <olink
+          targetdoc="&uima_docs_tutorial_guides;"
+          targetptr="ugr.tug.aae"/> will take the reader
+        through the details involved in creating UIMA annotators and analysis
+        engines.</para>
+      
+      <para>At the most primitive level an AE wraps an annotator adding the necessary APIs and
+        infrastructure for the composition and deployment of annotators within the UIMA
+        framework. The simplest AE contains exactly one annotator at its core. Complex AEs
+        may contain a collection of other AEs each potentially containing within them other
+        AEs. </para>
+    </section>
+    
+    <section id="ugr.ovv.conceptual.representing_results_in_cas">
+      <title>Representing Analysis Results in the CAS</title>
+      
+      <para>How annotators represent and share their results is an important part of the UIMA
+        architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure
+        (CAS)</emphasis> precisely for these purposes.</para>
+      
+      <para>The CAS is an object-based data structure that allows the representation of
+        objects, properties and values. Object types may be related to each other in a
+        single-inheritance hierarchy. The CAS logically (if not physically) contains the
+        document being analyzed. Analysis developers share and record their analysis
+        results in terms of an object model within the CAS. <footnote><para> We have plans to
+        extend the representational capabilities of the CAS and align its semantics with the
+        semantics of the OMG&apos;s Essential Meta-Object Facility (EMOF) and with the
+        semantics of the Eclipse Modeling Framework&apos;s ( <ulink
+          url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based
+        representation.</para> </footnote> </para>
+      
+      <para>The UIMA framework includes an implementation and interfaces to the CAS. For a
+        more detailed description of the CAS and its interfaces see <olink
+          targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para>
+      
+      <para>A CAS that logically contains statement 2 (repeated here for your
+        convenience)</para>
+      
+      
+      <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
+      
+      <para>would include objects of the Person type. For each person found in the body of a
+        document, the AE would create a Person object in the CAS and link it to the span of text
+        where the person was mentioned in the document.</para>
+      
+      <para>While the CAS is a general purpose data structure, UIMA defines a
+        few basic types and affords the developer the ability to extend these to define an
+        arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a
+        type system as an object schema for the CAS.</para>
+      
+      <para>A type system defines the various types of objects that may be discovered in 
+        documents by AE's that subscribe to that type system.</para>
+      
+      <para>As suggested above, Person may be defined as a type. Types have properties or
+          <emphasis role="bold">features</emphasis>. So for example,
+        <emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as
+        features of the Person type.</para>
+      
+      <para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money,
+        Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun
+        Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para>
+      
+      <para>There are no limits to the different types that may be defined in a type system. A
+        type system is domain and application specific.</para>
+      
+      <para>Types in a UIMA type system may be organized into a taxonomy. For example,
+        <emphasis>Company</emphasis> may be defined as a subtype of
+        <emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a
+        subtype of a <emphasis>ParseNode</emphasis>.</para>
+      
+      <section id="ugr.ovv.conceptual.annotation_type">
+        <title>The Annotation Type</title>
+        
+        <para>A general and common type used in artifact analysis and from which additional
+          types are often derived is the <emphasis role="bold">annotation</emphasis>
+          type. </para>
+        
+        <para>The annotation type is used to annotate or label regions of an artifact. Common
+          artifacts are text documents, but they can be other things, such as audio streams.
+          The annotation type for text includes two features, namely
+          <emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these
+          features represent integer offsets in the artifact and delimit a span. Any
+          particular annotation object identifies the span it annotates with the
+          <emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para>
+        
+        <para>The key idea here is that the annotation type is used to identify and label or
+          <quote>annotate</quote> a specific region of an artifact.</para>
+        
+        <para>Consider that the Person type is defined as a subtype of annotation. An
+          annotator, for example, can create a Person annotation to record the discovery of a
+          mention of a person between position 141 and 143 in document D102. The annotator can
+          create another person annotation to record the detection of a mention of a person in
+          the span between positions 101 and 112. </para>
+      </section>
+      <section id="ugr.ovv.conceptual.not_just_annotations">
+        <title>Not Just Annotations</title>
+        
+        <para>While the annotation type is a useful type for annotating regions of a
+          document, annotations are not the only kind of types in a CAS. A CAS is a general
+          representation scheme and may store arbitrary data structures to represent the
+          analysis of documents.</para>
+        
+        <para>As an example, consider statement 3 above (repeated here for your
+          convenience).</para>
+        
+        
+        <programlisting>(3) The Person denoted by span 101 to 112 and 
+  the Person denoted by span 141 to 143 in document D102 
+  refer to the same Entity.</programlisting>
+        
+        <para>This statement mentions two person annotations in the CAS; the first, call it
+          P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span
+          from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the
+          same entity. This means that while there are two expressions in the text
+          represented by the annotations P1 and P2, each refers to one and the same person.
+          </para>
+        
+        <para>The Entity type may be introduced into a type system to capture this kind of
+          information. The Entity type is not an annotation. It is intended to represent an
+          object in the domain which may be referred to by different expressions (or
+          mentions) occurring multiple times within a document (or across documents within
+          a collection of documents). The Entity type has a feature named
+          <emphasis>occurrences. </emphasis>This feature is used to point to all the
+          annotations believed to label mentions of the same entity.</para>
+        
+        <para>Consider that the spans annotated by P1 and P2 were <quote>Fred
+          Center</quote> and <quote>He</quote> respectively. The annotator might create
+          a new Entity object called
+          <code>FredCenter</code>. To represent the relationship in statement 3 above,
+          the annotator may link FredCenter to both P1 and P2 by making them values of its
+          <emphasis>occurrences</emphasis> feature.</para>
+        
+        <para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also
+          illustrates that an entity may be linked to annotations referring to regions of
+          image documents as well. To do this the annotation type would have to be extended
+          with the appropriate features to point to regions of an image.</para>
+      </section>
+      
+      <section id="ugr.ovv.conceptual.multiple_views_within_a_cas">
+        <title>Multiple Views within a CAS</title>
+        
+        <para>UIMA supports the simultaneous analysis of multiple views of a document. This
+          support comes in handy for processing multiple forms of the artifact, for example, the audio
+          and the closed captioned views of a single speech stream, or the tagged and detagged 
+          views of an HTML document.</para>
+        
+        <para>AEs analyze one or more views of a document. Each view contains a specific
+            <emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of
+          indexes holding metadata indexed by that view. The CAS, overall, holds one or more
+          CAS Views, plus the descriptive objects that represent the analysis results for
+          each. </para>
+        
+        <para>Another common example of using CAS Views is for different translations of a
+          document. Each translation may be represented with a different CAS View. Each
+          translation may be described by a different set of analysis results. For more
+          details on CAS Views and Sofas see <olink
+            targetdoc="&uima_docs_tutorial_guides;"
+            targetptr="ugr.tug.mvs"/> and <olink
+            targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para>
+      </section>
+    </section>
+    
+    <section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources">
+      <title>Interacting with the CAS and External Resources</title>
+      <titleabbrev>Using CASes and External Resources</titleabbrev>
+      
+      <para>The two main interfaces that a UIMA component developer interacts with are the
+        CAS and the UIMA Context.</para>
+      
+      <para>UIMA provides an efficient implementation of the CAS with multiple programming
+        interfaces. Through these interfaces, the annotator developer interacts with the
+        document and reads and writes analysis results. The CAS interfaces provide a suite of
+        access methods that allow the developer to obtain indexed iterators to the different
+        objects in the CAS. See <olink targetdoc="&uima_docs_ref;"
+          targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator
+        developer can obtain a specialized iterator to all Person objects associated with a
+        particular view, for example.</para>
+      
+      <para>For Java annotator developers, UIMA provides the JCas. This interface provides
+        the Java developer with a natural interface to CAS objects. Each type declared in the
+        type system appears as a Java Class; the UIMA framework renders the Person type as a
+        Person class in Java. As the analysis algorithm detects mentions of persons in the
+        documents, it can create Person objects in the CAS. For more details on how to interact
+        with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;"
+          targetptr="ugr.ref.jcas"/>.</para>
+      
+      <para>The component developer, in addition to interacting with the CAS, can access
+        external resources through the framework&apos;s resource manager interface
+        called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among
+        other things, can ensure that different annotators working together in an aggregate
+        flow may share the same instance of an external file, for example. For details on using
+        the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;"
+          targetptr="ugr.tug.aae"/>.</para>
+      
+    </section>
+    <section id="ugr.ovv.conceptual.component_descriptors">
+      <title>Component Descriptors</title>
+      <para>UIMA defines interfaces for a small set of core components that users of the
+        framework provide implmentations for. Annotators and Analysis Engines are two of
+        the basic building blocks specified by the architecture. Developers implement them
+        to build and compose analysis capabilities and ultimately applications.</para>
+      
+      <para>There are others components in addition to these, which we will learn about
+        later, but for every component specified in UIMA there are two parts required for its
+        implementation:</para>
+      
+      <orderedlist spacing="compact">
+        <listitem><para>the declarative part and</para></listitem>
+        <listitem><para>the code part.</para></listitem>
+      </orderedlist>
+      
+      <para>The declarative part contains metadata describing the component, its
+        identity, structure and behavior and is called the <emphasis role="bold">
+        Component Descriptor</emphasis>. Component descriptors are represented in XML.
+        The code part implements the algorithm. The code part may be a program in Java.</para>
+      
+      <para>As a developer using the UIMA SDK, to implement a UIMA component it is always the
+        case that you will provide two things: the code part and the Component Descriptor.
+        Note that when you are composing an engine, the code may be already provided in
+        reusable subcomponents. In these cases you may not be developing new code but rather
+        composing an aggregate engine by pointing to other components where the code has been
+        included.</para>
+      
+      <para>Component descriptors are represented in XML and aid in component discovery,
+        reuse, composition and development tooling. The UIMA SDK provides tools for easily
+        creating and maintaining the component descriptors that relieve the developer from
+        editing XML directly. This tool is described briefly in <olink
+          targetdoc="&uima_docs_tutorial_guides;"
+          targetptr="ugr.tug.aae"/>, and more
+        thoroughly in <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>
+        .</para>
+      
+      <para>Component descriptors contain standard metadata including the
+        component&apos;s name, author, version, and a reference to the class that
+        implements the component.</para>
+      
+      <para>In addition to these standard fields, a component descriptor identifies the
+        type system the component uses and the types it requires in an input CAS and the types it
+        plans to produce in an output CAS.</para>
+      
+      <para>For example, an AE that detects person types may require as input a CAS that
+        includes a tokenization and deep parse of the document. The descriptor refers to a
+        type system to make the component&apos;s input requirements and output types
+        explicit. In effect, the descriptor includes a declarative description of the
+        component&apos;s behavior and can be used to aid in component discovery and
+        composition based on desired results. UIMA analysis engines provide an interface
+        for accessing the component metadata represented in their descriptors. For more

[... 484 lines stripped ...]