You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2008/08/28 23:28:16 UTC
svn commit: r689997 [2/32] - in /incubator/uima/uimaj/trunk/uima-docbooks:
./ src/ src/docbook/overview_and_setup/ src/docbook/references/
src/docbook/tools/ src/docbook/tutorials_and_users_guides/
src/docbook/uima/organization/ src/olink/references/
Modified: incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml
URL: http://svn.apache.org/viewvc/incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml?rev=689997&r1=689996&r2=689997&view=diff
==============================================================================
--- incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml (original)
+++ incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/overview_and_setup/conceptual_overview.xml Thu Aug 28 14:28:14 2008
@@ -1,978 +1,978 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
-"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
-<!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:">
-<!ENTITY imgroot "../images/overview_and_setup/conceptual_overview_files/" >
-<!ENTITY % uimaents SYSTEM "../entities.ent" >
-%uimaents;
-]>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<chapter id="ugr.ovv.conceptual">
- <title>UIMA Conceptual Overview</title>
-
- <para>UIMA is an open, industrial-strength, scaleable and extensible platform for
- creating, integrating and deploying unstructured information management solutions
- from powerful text or multi-modal analysis and search components. </para>
-
- <para>The Apache UIMA project is an implementation of the Java UIMA framework available
- under the Apache License, providing a common foundation for industry and academia to
- collaborate and accelerate the world-wide development of technologies critical for
- discovering vital knowledge present in the fastest growing sources of information
- today.</para>
-
- <para>This chapter presents an introduction to many essential UIMA concepts. It is meant to
- provide a broad overview to give the reader a quick sense of UIMA's basic
- architectural philosophy and the UIMA SDK's capabilities. </para>
-
- <para>This chapter provides a general orientation to UIMA and makes liberal reference to
- the other chapters in the UIMA SDK documentation set, where the reader may find detailed
- treatments of key concepts and development practices. It may be useful to refer to <olink
- targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar
- with the terminology in this overview.</para>
-
- <section id="ugr.ovv.conceptual.uima_introduction">
- <title>UIMA Introduction</title>
- <figure id="ugr.ovv.conceptual.fig.bridge">
- <title>UIMA helps you build the bridge between the unstructured and structured
- worlds</title>
- <mediaobject>
- <imageobject>
- <imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/>
- </imageobject>
- <textobject><phrase>Picture of a bridge between unstructured information
- artifacts and structured metadata about those artifacts</phrase>
- </textobject>
- </mediaobject>
- </figure>
-
- <para> Unstructured information represents the largest, most current and fastest
- growing source of information available to businesses and governments. The web is just
- the tip of the iceberg. Consider the mounds of information hosted in the enterprise and
- around the world and across different media including text, voice and video. The
- high-value content in these vast collections of unstructured information is,
- unfortunately, buried in lots of noise. Searching for what you need or doing
- sophisticated data mining over unstructured information sources presents new
- challenges. </para>
-
- <para>An unstructured information management (UIM) application may be generally
- characterized as a software system that analyzes large volumes of unstructured
- information (text, audio, video, images, etc.) to discover, organize and deliver
- relevant knowledge to the client or application end-user. An example is an application
- that processes millions of medical abstracts to discover critical drug interactions.
- Another example is an application that processes tens of millions of documents to
- discover key evidence indicating probable competitive threats. </para>
-
- <para>First and foremost, the unstructured data must be analyzed to interpret, detect
- and locate concepts of interest, for example, named entities like persons,
- organizations, locations, facilities, products etc., that are not explicitly tagged
- or annotated in the original artifact. More challenging analytics may detect things
- like opinions, complaints, threats or facts. And then there are relations, for
- example, located in, finances, supports, purchases, repairs etc. The list of concepts
- important for applications to discover in unstructured content is large, varied and
- often domain specific.
- Many different component analytics may solve different parts of the overall analysis task.
- These component analytics must interoperate and must be easily combined to facilitate
- the developed of UIM applications.</para>
-
- <para>The result of analysis are used to populate structured forms so that conventional
- data processing and search technologies
- like search engines, database engines or OLAP
- (On-Line Analytical Processing, or Data Mining) engines
- can efficiently deliver the newly discovered content in response to the client requests
- or queries.</para>
-
- <para>In analyzing unstructured content, UIM applications make use of a variety of
- analysis technologies including:</para>
-
- <itemizedlist spacing="compact">
- <listitem><para>Statistical and rule-based Natural Language Processing
- (NLP)</para>
- </listitem>
- <listitem><para>Information Retrieval (IR)</para>
- </listitem>
- <listitem><para>Machine learning</para>
- </listitem>
- <listitem><para>Ontologies</para>
- </listitem>
- <listitem><para>Automated reasoning and</para>
- </listitem>
- <listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para>
- </listitem>
-
- </itemizedlist>
-
- <para>Specific analysis capabilities using these technologies are developed
- independently using different techniques, interfaces and platforms.
- </para>
-
- <para>The bridge from the unstructured world to the structured world is built through the
- composition and deployment of these analysis capabilities. This integration is often
- a costly challenge. </para>
-
- <para>The Unstructured Information Management Architecture (UIMA) is an architecture
- and software framework that helps you build that bridge. It supports creating,
- discovering, composing and deploying a broad range of analysis capabilities and
- linking them to structured information services.</para>
-
- <para>UIMA allows development teams to match the right skills with the right parts of a
- solution and helps enable rapid integration across technologies and platforms using a
- variety of different deployment options. These ranging from tightly-coupled
- deployments for high-performance, single-machine, embedded solutions to parallel
- and fully distributed deployments for highly flexible and scaleable
- solutions.</para>
-
- </section>
-
- <section id="ugr.ovv.conceptual.architecture_framework_sdk">
- <title>The Architecture, the Framework and the SDK</title>
- <para>UIMA is a software architecture which specifies component interfaces, data
- representations, design patterns and development roles for creating, describing,
- discovering, composing and deploying multi-modal analysis capabilities.</para>
-
- <para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time
- environment in which developers can plug in their UIMA component implementations and
- with which they can build and deploy UIM applications. The framework is not specific to
- any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA
- Framework.</para>
-
- <para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis>
- includes the UIMA framework, plus tools and utilities for using UIMA. Some of the
- tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>)
- development environment. </para>
-
- </section>
-
- <section id="ugr.ovv.conceptual.analysis_basics">
- <title>Analysis Basics</title>
- <note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator
- Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA
- Context.</para>
- </note>
-
- <section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results">
- <title>Analysis Engines, Annotators & Results</title>
- <figure id="ugr.ovv.conceptual.metadata_in_cas">
- <title>Objects represented in the Common Analysis Structure (CAS)</title>
- <mediaobject>
- <imageobject role="html">
- <imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/>
- </imageobject>
- <imageobject role="fo">
- <imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/>
- </imageobject>
- <textobject><phrase>Picture of some text, with a hierarchy of discovered
- metadata about words in the text, including some image of a person as metadata
- about that name.</phrase>
- </textobject>
- </mediaobject>
- </figure>
-
- <para>UIMA is an architecture in which basic building blocks called Analysis Engines
- (AEs) are composed to analyze a document and infer and record descriptive attributes
- about the document as a whole, and/or about regions therein. This descriptive
- information, produced by AEs is referred to generally as <emphasis role="bold">
- analysis results</emphasis>. Analysis results typically represent meta-data
- about the document content. One way to think about AEs is as software agents that
- automatically discover and record meta-data about original content.</para>
-
- <para>UIMA supports the analysis of different modalities including text, audio and
- video. The majority of examples we provide are for text. We use the term <emphasis
- role="bold">document, </emphasis>therefore, to generally refer to any unit of
- content that an AE may process, whether it is a text document or a segment of audio, for
- example. See the section <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.mvs"/> for more information on multimodal processing
- in UIMA.</para>
-
- <para>Analysis results include different statements about the content of a document.
- For example, the following is an assertion about the topic of a document:</para>
-
-
- <programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting>
-
- <para>Analysis results may include statements describing regions more granular than
- the entire document. We use the term <emphasis role="bold">span</emphasis> to
- refer to a sequence of characters in a text document. Consider that a document with the
- identifier D102 contains a span, <quote>Fred Centers</quote> starting at
- character position 101. An AE that can detect persons in text may represent the
- following statement as an analysis result:</para>
-
-
- <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
-
- <para>In both statements 1 and 2 above there is a special pre-defined term or what we call
- in UIMA a <emphasis role="bold">Type</emphasis>. They are
- <emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively.
- UIMA types characterize the kinds of results that an AE may create – more on
- types later.</para>
-
- <para>Other analysis results may relate two statements. For example, an AE might
- record in its results that two spans are both referring to the same person:</para>
-
-
- <programlisting>(3) The Person denoted by span 101 to 112 and
- the Person denoted by span 141 to 143 in document D102
- refer to the same Entity.</programlisting>
-
- <para>The above statements are some examples of the kinds of results that AEs may record
- to describe the content of the documents they analyze. These are not meant to indicate
- the form or syntax with which these results are captured in UIMA – more on that
- later in this overview.</para>
-
- <para>The UIMA framework treats Analysis engines as pluggable, composible,
- discoverable, managed objects. At the heart of AEs are the analysis algorithms that
- do all the work to analyze documents and record analysis results. </para>
-
- <para>UIMA provides a basic component type intended to house the core analysis
- algorithms running inside AEs. Instances of this component are called <emphasis
- role="bold">Annotators</emphasis>. The analysis algorithm developer's
- primary concern therefore is the development of annotators. The UIMA framework
- provides the necessary methods for taking annotators and creating analysis
- engines.</para>
-
- <para>In UIMA the person who codes analysis algorithms takes on the role of the
- <emphasis role="bold">Annotator Developer</emphasis>. <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.aae"/> will take the reader
- through the details involved in creating UIMA annotators and analysis
- engines.</para>
-
- <para>At the most primitive level an AE wraps an annotator adding the necessary APIs and
- infrastructure for the composition and deployment of annotators within the UIMA
- framework. The simplest AE contains exactly one annotator at its core. Complex AEs
- may contain a collection of other AEs each potentially containing within them other
- AEs. </para>
- </section>
-
- <section id="ugr.ovv.conceptual.representing_results_in_cas">
- <title>Representing Analysis Results in the CAS</title>
-
- <para>How annotators represent and share their results is an important part of the UIMA
- architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure
- (CAS)</emphasis> precisely for these purposes.</para>
-
- <para>The CAS is an object-based data structure that allows the representation of
- objects, properties and values. Object types may be related to each other in a
- single-inheritance hierarchy. The CAS logically (if not physically) contains the
- document being analyzed. Analysis developers share and record their analysis
- results in terms of an object model within the CAS. <footnote><para> We have plans to
- extend the representational capabilities of the CAS and align its semantics with the
- semantics of the OMG's Essential Meta-Object Facility (EMOF) and with the
- semantics of the Eclipse Modeling Framework's ( <ulink
- url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based
- representation.</para> </footnote> </para>
-
- <para>The UIMA framework includes an implementation and interfaces to the CAS. For a
- more detailed description of the CAS and its interfaces see <olink
- targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para>
-
- <para>A CAS that logically contains statement 2 (repeated here for your
- convenience)</para>
-
-
- <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
-
- <para>would include objects of the Person type. For each person found in the body of a
- document, the AE would create a Person object in the CAS and link it to the span of text
- where the person was mentioned in the document.</para>
-
- <para>While the CAS is a general purpose data structure, UIMA defines a
- few basic types and affords the developer the ability to extend these to define an
- arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a
- type system as an object schema for the CAS.</para>
-
- <para>A type system defines the various types of objects that may be discovered in
- documents by AE's that subscribe to that type system.</para>
-
- <para>As suggested above, Person may be defined as a type. Types have properties or
- <emphasis role="bold">features</emphasis>. So for example,
- <emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as
- features of the Person type.</para>
-
- <para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money,
- Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun
- Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para>
-
- <para>There are no limits to the different types that may be defined in a type system. A
- type system is domain and application specific.</para>
-
- <para>Types in a UIMA type system may be organized into a taxonomy. For example,
- <emphasis>Company</emphasis> may be defined as a subtype of
- <emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a
- subtype of a <emphasis>ParseNode</emphasis>.</para>
-
- <section id="ugr.ovv.conceptual.annotation_type">
- <title>The Annotation Type</title>
-
- <para>A general and common type used in artifact analysis and from which additional
- types are often derived is the <emphasis role="bold">annotation</emphasis>
- type. </para>
-
- <para>The annotation type is used to annotate or label regions of an artifact. Common
- artifacts are text documents, but they can be other things, such as audio streams.
- The annotation type for text includes two features, namely
- <emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these
- features represent integer offsets in the artifact and delimit a span. Any
- particular annotation object identifies the span it annotates with the
- <emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para>
-
- <para>The key idea here is that the annotation type is used to identify and label or
- <quote>annotate</quote> a specific region of an artifact.</para>
-
- <para>Consider that the Person type is defined as a subtype of annotation. An
- annotator, for example, can create a Person annotation to record the discovery of a
- mention of a person between position 141 and 143 in document D102. The annotator can
- create another person annotation to record the detection of a mention of a person in
- the span between positions 101 and 112. </para>
- </section>
- <section id="ugr.ovv.conceptual.not_just_annotations">
- <title>Not Just Annotations</title>
-
- <para>While the annotation type is a useful type for annotating regions of a
- document, annotations are not the only kind of types in a CAS. A CAS is a general
- representation scheme and may store arbitrary data structures to represent the
- analysis of documents.</para>
-
- <para>As an example, consider statement 3 above (repeated here for your
- convenience).</para>
-
-
- <programlisting>(3) The Person denoted by span 101 to 112 and
- the Person denoted by span 141 to 143 in document D102
- refer to the same Entity.</programlisting>
-
- <para>This statement mentions two person annotations in the CAS; the first, call it
- P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span
- from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the
- same entity. This means that while there are two expressions in the text
- represented by the annotations P1 and P2, each refers to one and the same person.
- </para>
-
- <para>The Entity type may be introduced into a type system to capture this kind of
- information. The Entity type is not an annotation. It is intended to represent an
- object in the domain which may be referred to by different expressions (or
- mentions) occurring multiple times within a document (or across documents within
- a collection of documents). The Entity type has a feature named
- <emphasis>occurrences. </emphasis>This feature is used to point to all the
- annotations believed to label mentions of the same entity.</para>
-
- <para>Consider that the spans annotated by P1 and P2 were <quote>Fred
- Center</quote> and <quote>He</quote> respectively. The annotator might create
- a new Entity object called
- <code>FredCenter</code>. To represent the relationship in statement 3 above,
- the annotator may link FredCenter to both P1 and P2 by making them values of its
- <emphasis>occurrences</emphasis> feature.</para>
-
- <para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also
- illustrates that an entity may be linked to annotations referring to regions of
- image documents as well. To do this the annotation type would have to be extended
- with the appropriate features to point to regions of an image.</para>
- </section>
-
- <section id="ugr.ovv.conceptual.multiple_views_within_a_cas">
- <title>Multiple Views within a CAS</title>
-
- <para>UIMA supports the simultaneous analysis of multiple views of a document. This
- support comes in handy for processing multiple forms of the artifact, for example, the audio
- and the closed captioned views of a single speech stream, or the tagged and detagged
- views of an HTML document.</para>
-
- <para>AEs analyze one or more views of a document. Each view contains a specific
- <emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of
- indexes holding metadata indexed by that view. The CAS, overall, holds one or more
- CAS Views, plus the descriptive objects that represent the analysis results for
- each. </para>
-
- <para>Another common example of using CAS Views is for different translations of a
- document. Each translation may be represented with a different CAS View. Each
- translation may be described by a different set of analysis results. For more
- details on CAS Views and Sofas see <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.mvs"/> and <olink
- targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para>
- </section>
- </section>
-
- <section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources">
- <title>Interacting with the CAS and External Resources</title>
- <titleabbrev>Using CASes and External Resources</titleabbrev>
-
- <para>The two main interfaces that a UIMA component developer interacts with are the
- CAS and the UIMA Context.</para>
-
- <para>UIMA provides an efficient implementation of the CAS with multiple programming
- interfaces. Through these interfaces, the annotator developer interacts with the
- document and reads and writes analysis results. The CAS interfaces provide a suite of
- access methods that allow the developer to obtain indexed iterators to the different
- objects in the CAS. See <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator
- developer can obtain a specialized iterator to all Person objects associated with a
- particular view, for example.</para>
-
- <para>For Java annotator developers, UIMA provides the JCas. This interface provides
- the Java developer with a natural interface to CAS objects. Each type declared in the
- type system appears as a Java Class; the UIMA framework renders the Person type as a
- Person class in Java. As the analysis algorithm detects mentions of persons in the
- documents, it can create Person objects in the CAS. For more details on how to interact
- with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.jcas"/>.</para>
-
- <para>The component developer, in addition to interacting with the CAS, can access
- external resources through the framework's resource manager interface
- called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among
- other things, can ensure that different annotators working together in an aggregate
- flow may share the same instance of an external file, for example. For details on using
- the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.aae"/>.</para>
-
- </section>
- <section id="ugr.ovv.conceptual.component_descriptors">
- <title>Component Descriptors</title>
- <para>UIMA defines interfaces for a small set of core components that users of the
- framework provide implmentations for. Annotators and Analysis Engines are two of
- the basic building blocks specified by the architecture. Developers implement them
- to build and compose analysis capabilities and ultimately applications.</para>
-
- <para>There are others components in addition to these, which we will learn about
- later, but for every component specified in UIMA there are two parts required for its
- implementation:</para>
-
- <orderedlist spacing="compact">
- <listitem><para>the declarative part and</para></listitem>
- <listitem><para>the code part.</para></listitem>
- </orderedlist>
-
- <para>The declarative part contains metadata describing the component, its
- identity, structure and behavior and is called the <emphasis role="bold">
- Component Descriptor</emphasis>. Component descriptors are represented in XML.
- The code part implements the algorithm. The code part may be a program in Java.</para>
-
- <para>As a developer using the UIMA SDK, to implement a UIMA component it is always the
- case that you will provide two things: the code part and the Component Descriptor.
- Note that when you are composing an engine, the code may be already provided in
- reusable subcomponents. In these cases you may not be developing new code but rather
- composing an aggregate engine by pointing to other components where the code has been
- included.</para>
-
- <para>Component descriptors are represented in XML and aid in component discovery,
- reuse, composition and development tooling. The UIMA SDK provides tools for easily
- creating and maintaining the component descriptors that relieve the developer from
- editing XML directly. This tool is described briefly in <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.aae"/>, and more
- thoroughly in <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>
- .</para>
-
- <para>Component descriptors contain standard metadata including the
- component's name, author, version, and a reference to the class that
- implements the component.</para>
-
- <para>In addition to these standard fields, a component descriptor identifies the
- type system the component uses and the types it requires in an input CAS and the types it
- plans to produce in an output CAS.</para>
-
- <para>For example, an AE that detects person types may require as input a CAS that
- includes a tokenization and deep parse of the document. The descriptor refers to a
- type system to make the component's input requirements and output types
- explicit. In effect, the descriptor includes a declarative description of the
- component's behavior and can be used to aid in component discovery and
- composition based on desired results. UIMA analysis engines provide an interface
- for accessing the component metadata represented in their descriptors. For more
- details on the structure of UIMA component descriptors refer to <olink
- targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.component_descriptor"/>.</para>
-
- </section>
- </section>
- <section id="ugr.ovv.conceptual.aggregate_analysis_engines">
- <title>Aggregate Analysis Engines</title>
-
- <note><title>&key_concepts;</title><para>Aggregate Analysis Engine, Delegate Analysis Engine,
- Tightly and Loosely Coupled, Flow Specification, Analysis Engine Assembler</para>
- </note>
-
- <figure id="ugr.ovv.conceptual.sample_aggregate">
- <title>Sample Aggregate Analysis Engine</title>
- <mediaobject>
- <imageobject role="html">
- <imagedata width="588px" format="PNG" fileref="&imgroot;image006.png"/>
- </imageobject>
- <imageobject role="fo">
- <imagedata width="5.5in" format="PNG" fileref="&imgroot;image006.png"/>
- </imageobject>
- <textobject><phrase>Picture of multiple parts (a language identifier,
- tokenizer, part of speech annotator, shallow parser, and named entity detector)
- strung together into a flow, and all of them wrapped as a single aggregate object,
- which produces as annotations the union of all the results of the individual
- annotator components ( tokens, parts of speech, names, organizations, places,
- persons, etc.)</phrase>
- </textobject>
- </mediaobject>
- </figure>
-
- <para>A simple or primitive UIMA Analysis Engine (AE) contains a single annotator. AEs,
- however, may be defined to contain other AEs organized in a workflow. These more complex
- analysis engines are called <emphasis role="bold">Aggregate Analysis
- Engines.</emphasis> </para>
-
- <para>Annotators tend to perform fairly granular functions, for example language
- detection, tokenization or part of speech detection.
- These functions typically address just part of an overall analysis task. A workflow
- of component engines may be orchestrated to perform more complex tasks.</para>
-
- <para>An AE that performs named entity detection, for example, may
- include a pipeline of annotators starting with language detection feeding
- tokenization, then part-of-speech detection, then deep grammatical parsing and then
- finally named-entity detection. Each step in the pipeline is required by the
- subsequent analysis. For example, the final named-entity annotator can only do its
- analysis if the previous deep grammatical parse was recorded in the CAS.</para>
-
- <para>Aggregate AEs are built to encapsulate potentially complex internal structure
- and insulate it from users of the AE. In our example, the aggregate analysis engine
- developer acquires the internal components, defines the necessary flow
- between them and publishes the resulting AE. Consider the simple example illustrated
- in <xref linkend="ugr.ovv.conceptual.sample_aggregate"/> where
- <quote>MyNamed-EntityDetector</quote> is composed of a linear flow of more
- primitive analysis engines.</para>
-
- <para>Users of this AE need not know how it is constructed internally but only need its name
- and its published input requirements and output types. These must be declared in the
- aggregate AE's descriptor. Aggregate AE's descriptors declare the components
- they contain and a <emphasis role="bold">flow specification</emphasis>. The flow
- specification defines the order in which the internal component AEs should be run. The
- internal AEs specified in an aggregate are also called the <emphasis role="bold">
- delegate analysis engines.</emphasis> The term "delegate" is used because aggregate AE's
- are thought to "delegate" functions to their internal AEs.</para>
-
- <para>
- In UIMA 2.0, the developer can implement a "Flow Controller" and include it as part
- of an aggregate AE by referring to it in the aggregate AE's descriptor.
- The flow controller is responsible for computing the "flow", that is,
- for determining the order in which of delegate AE's that will process the CAS.
- The Flow Contoller has access to the CAS and any external resources it may require
- for determining the flow. It can do this dynamically at run-time, it can
- make multi-step decisions and it can consider any sort of flow specification
- included in the aggregate AE's descriptor. See
- <olink targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.fc"/>
- for details on the UIMA Flow Controller interface.
- </para>
-
- <para>We refer to the development role associated with building an aggregate from
- delegate AEs as the <emphasis role="bold">Analysis Engine Assembler</emphasis>
- .</para>
-
- <para>The UIMA framework, given an aggregate analysis engine descriptor, will run all
- delegate AEs, ensuring that each one gets access to the CAS in the sequence produced by
- the flow controller. The UIMA framework is equipped to handle different
- deployments where the delegate engines, for example, are <emphasis role="bold">
- tightly-coupled</emphasis> (running in the same process) or <emphasis role="bold">
- loosely-coupled</emphasis> (running in separate processes or even on different
- machines). The framework supports a number of remote protocols for loose coupling
- deployments of aggregate analysis engines, including SOAP (which stands for Simple
- Object Access Protocol, a standard Web Services communications protocol).</para>
-
- <para>The UIMA framework facilitates the deployment of AEs as remote services by using an
- adapter layer that automatically creates the necessary infrastructure in response to
- a declaration in the component's descriptor. For more details on creating
- aggregate analysis engines refer to <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor"/> The component descriptor editor tool
- assists in the specification of aggregate AEs from a repository of available engines.
- For more details on this tool refer to <olink targetdoc="&uima_docs_tools;"
- targetptr="ugr.tools.cde"/>.</para>
-
- <para>The UIMA framework implementation has two built-in flow implementations: one
- that support a linear flow between components, and one with conditional branching
- based on the language of the document. It also supports user-provided flow
- controllers, as described in <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.fc"/>. Furthermore, the application developer is
- free to create multiple AEs and provide their own logic to combine the AEs in arbitrarily
- complex flows. For more details on this the reader may refer to <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application.using_aes"/>.</para>
-
- </section>
-
- <section id="ugr.ovv.conceptual.applicaiton_building_and_collection_processing">
- <title>Application Building and Collection Processing</title>
-
- <note><title>&key_concepts;</title><para>Process Method, Collection Processing Architecture,
- Collection Reader, CAS Consumer, CAS Initializer, Collection Processing Engine,
- Collection Processing Manager.</para></note>
-
- <section id="ugr.ovv.conceptual.using_framework_from_an_application">
- <title>Using the framework from an Application</title>
-
- <figure id="ugr.ovv.conceptual.application_factory_ae">
- <title>Using UIMA Framework to create and interact with an Analysis Engine</title>
- <mediaobject>
- <imageobject role="html">
- <imagedata width="618px" align="center" format="PNG" fileref="&imgroot;image008.png"/>
- </imageobject>
- <imageobject role="fo">
- <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image008.png"/>
- </imageobject>
- <textobject><phrase>Picture of application interacting with UIMA's
- factory to produce an analysis engine, which acts as a container for annotators,
- and interfaces with the application via the process and getMetaData methods
- among others.</phrase>
- </textobject>
- </mediaobject>
- </figure>
-
- <para>As mentioned above, the basic AE interface may be thought of as simply CAS in/CAS
- out.</para>
-
- <para>The application is responsible for interacting with the UIMA framework to
- instantiate an AE, create or acquire an input CAS, initialize the input CAS with a
- document and then pass it to the AE through the <emphasis role="bold">process
- method</emphasis>. This interaction with the framework is illustrated in <xref
- linkend="ugr.ovv.conceptual.application_factory_ae"/>. </para>
-
- <para>The UIMA AE Factory takes the declarative information from the Component
- Descriptor and the class files implementing the annotator, and instantiates the AE
- instance, setting up the CAS and the UIMA Context.</para>
-
- <para>The AE, possibly calling many delegate AEs internally, performs the overall
- analysis and its process method returns the CAS containing new analysis results.
- </para>
-
- <para>The application then decides what to do with the returned CAS. There are many
- possibilities. For instance the application could: display the results, store the
- CAS to disk for post processing, extract and index analysis results as part of a search
- or database application etc.</para>
-
- <para>The UIMA framework provides methods to support the application developer in
- creating and managing CASes and instantiating, running and managing AEs. Details
- may be found in <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application"/>.</para>
- </section>
-
- <section id="ugr.ovv.conceptual.graduating_to_collection_processing">
- <title>Graduating to Collection Processing</title>
- <figure id="ugr.ovv.conceptual.fig.cpe">
- <title>High-Level UIMA Component Architecture from Source to Sink</title>
- <mediaobject>
- <imageobject role="html">
- <imagedata width="578px" format="PNG" align="center" fileref="&imgroot;image010.png"/>
- </imageobject>
- <imageobject role="fo">
- <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image010.png"/>
- </imageobject>
- </mediaobject>
- </figure>
-
- <para>Many UIM applications analyze entire collections of documents. They connect to
- different document sources and do different things with the results. But in the
- typical case, the application must generally follow these logical steps:
-
- <orderedlist spacing="compact">
- <listitem><para>Connect to a physical source</para></listitem>
- <listitem><para>Acquire a document from the source</para></listitem>
- <listitem><para>Initialize a CAS with the document to be analyzed</para>
- </listitem>
- <listitem><para>Send the CAS to a selected analysis engine</para></listitem>
- <listitem><para>Process the resulting CAS</para></listitem>
- <listitem><para>Go back to 2 until the collection is processed</para>
- </listitem>
- <listitem><para>Do any final processing required after all the documents in the
- collection have been analyzed</para></listitem>
- </orderedlist> </para>
-
- <para>UIMA supports UIM application development for this general type of processing
- through its <emphasis role="bold">Collection Processing
- Architecture</emphasis>.</para>
-
- <para>As part of the collection processing architecture UIMA introduces two primary
- components in addition to the annotator and analysis engine. These are the <emphasis
- role="bold">Collection Reader</emphasis> and the <emphasis role="bold">CAS
- Consumer</emphasis>. The complete flow from source, through document analysis,
- and to CAS Consumers supported by UIMA is illustrated in <xref
- linkend="ugr.ovv.conceptual.fig.cpe"/>.</para>
-
- <para>The Collection Reader's job is to connect to and iterate through a source
- collection, acquiring documents and initializing CASes for analysis. </para>
-
- <!--
- <para>Since the structure, access and iteration methods for
- physical document sources vary independently from the format of stored
- documents, UIMA defines another type of component called a <emphasis role="bold">CAS Intializer</emphasis>.
- The CAS Initializer's job is specific to a
- document format and specialized logic for mapping that format to a CAS. In the
- simplest case a CAS Intializer may take the document provided by the containing
- Collection Reader and insert it as a subject of analysis (or Sofa) in the
- CAS. A more advanced scenario is one
- where the CAS Intializer may be implemented to handle documents that conform to
- a certain XML schema and map some subset of the XML tags to CAS types and then
- insert the de-tagged document content as the subject of analysis. Collection Readers may reuse plug-in CAS
- Initializers for different document formats.</para>
- -->
-
- <para>CAS Consumers, as the name suggests, function at the end of the flow. Their job is
- to do the final CAS processing. A CAS Consumer may be implemented, for example, to
- index CAS contents in a search engine, extract elements of interest and populate a
- relational database or serialize and store analysis results to disk for subsequent
- and further analysis. </para>
-
- <para>A Semantic Search engine that works with UIMA is available from <ulink
- url="http://www.alphaworks.ibm.com/tech/uima">IBM's alphaWorks
- site</ulink> which will allow the developer to experiment with indexing analysis
- results and querying for documents based on all the annotations in the CAS. See the
- section on integrating text analysis and search in <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application"/>.</para>
-
- <para>A UIMA <emphasis role="bold">Collection Processing Engine</emphasis> (CPE)
- is an aggregate component that specifies a <quote>source to sink</quote> flow from a
- Collection Reader though a set of analysis engines and then to a set of CAS Consumers.
- </para>
-
- <para>CPEs are specified by XML files called CPE Descriptors. These are declarative
- specifications that point to their contained components (Collection Readers,
- analysis engines and CAS Consumers) and indicate a flow among them. The flow
- specification allows for filtering capabilities to, for example, skip over AEs
- based on CAS contents. Details about the format of CPE Descriptors may be found in
- <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>.
- </para>
-
- <figure id="ugr.ovv.conceptual.fig.cpm">
- <title>Collection Processing Manager in UIMA Framework</title>
- <mediaobject>
- <imageobject role="html">
- <imagedata width="576px" align="center" format="PNG" fileref="&imgroot;image012.png"/>
- </imageobject>
- <imageobject role="fo">
- <imagedata width="5.5in" align="center" format="PNG" fileref="&imgroot;image012.png"/>
- </imageobject>
- <textobject><phrase>box and arrows picture of application using CPE factory to
- instantiate a Collection Processing Engine, and that engine interacting with
- the application.</phrase></textobject>
- </mediaobject>
- </figure>
-
- <para>The UIMA framework includes a <emphasis role="bold">Collection Processing
- Manager</emphasis> (CPM). The CPM is capable of reading a CPE descriptor, and
- deploying and running the specified CPE. <xref
- linkend="ugr.ovv.conceptual.fig.cpe"/> illustrates the role of the CPM
- in the UIMA Framework.</para>
-
- <para>Key features of the CPM are failure recovery, CAS management and scale-out.
- </para>
-
- <para>Collections may be large and take considerable time to analyze. A configurable
- behavior of the CPM is to log faults on single document failures while continuing to
- process the collection. This behavior is commonly used because analysis components
- often tend to be the weakest link -- in practice they may choke on strangely formatted
- content. </para>
-
- <para>This deployment option requires that the CPM run in a separate process or a
- machine distinct from the CPE components. A CPE may be configured to run with a variety
- of deployment options that control the features provided by the CPM. For details see
- <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>
- .</para>
-
- <para>The UIMA SDK also provides a tool called the CPE Configurator. This tool provides
- the developer with a user interface that simplifies the process of connecting up all
- the components in a CPE and running the result. For details on using the CPE
- Configurator see <olink targetdoc="&uima_docs_tools;"
- targetptr="ugr.tools.cpe"/>. This tool currently does not provide
- access to the full set of CPE deployment options supported by the CPM; however, you can
- configure other parts of the CPE descriptor by editing it directly. For details on how
- to create and run CPEs refer to <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.cpe"/>.</para>
-
- </section>
-
- </section>
-
- <section id="ugr.ovv.conceptual.exploiting_analysis_results">
- <title>Exploiting Analysis Results</title>
-
- <note><title>&key_concepts;</title><para>Semantic Search, XML Fragment Queries.</para>
- </note>
-
- <section id="ugr.ovv.conceptual.semantic_search">
- <title>Semantic Search</title>
-
- <para>In a simple UIMA Collection Processing Engine (CPE), a Collection Reader reads
- documents from the file system and initializes CASs with their content. These are
- then fed to an AE that annotates tokens and sentences, the CASs, now enriched with
- token and sentence information, are passed to a CAS Consumer that populates a search
- engine index. </para>
-
- <para>The search engine query processor can then use the token index to provide basic
- key-word search. For example, given a query <quote>center</quote> the search
- engine would return all the documents that contained the word
- <quote>center</quote>.</para>
-
- <para><emphasis role="bold">Semantic Search</emphasis> is a search paradigm that
- can exploit the additional metadata generated by analytics like a UIMA CPE.</para>
-
- <para>Consider that we plugged a named-entity recognizer into the CPE described
- above. Assume this analysis engine is capable of detecting in documents and
- annotating in the CAS mentions of persons and organizations.</para>
-
- <para>Complementing the name-entity recognizer we add a CAS Consumer that extracts in
- addition to token and sentence annotations, the person and organizations added to
- the CASs by the name-entity detector. It then feeds these into the semantic search
- engine's index.</para>
-
- <para>The semantic search engine that comes with the UIMA SDK, for example, can exploit
- this addition information from the CAS to support more powerful queries. For
- example, imagine a user is looking for documents that mention an organization with
- <quote>center</quote> it is name but is not sure of the full or precise name of the
- organization. A key-word search on <quote>center</quote> would likely produce way
- too many documents because <quote>center</quote> is a common and ambiguous term.
- The semantic search engine that is available from <ulink
- url="http://www.alphaworks.ibm.com/tech/uima"/> supports a query language
- called <emphasis role="bold">XML Fragments</emphasis>. This query language is
- designed to exploit the CAS annotations entered in its index. The XML Fragment query,
- for example,
-
-
- <programlisting><organization> center </organization></programlisting>
- will produce first only documents that contain <quote>center</quote> where it
- appears as part of a mention annotated as an organization by the name-entity
- recognizer. This will likely be a much shorter list of documents more precisely
- matching the user's interest.</para>
-
- <para>Consider taking this one step further. We add a relationship recognizer that
- annotates mentions of the CEO-of relationship. We configure the CAS Consumer so that
- it sends these new relationship annotations to the semantic search index as well.
- With these additional analysis results in the index we can submit queries like
-
-
- <programlisting><ceo_of>
- <person> center </person>
- <organization> center </organization>
-<ceo_of></programlisting>
- This query will precisely target documents that contain a mention of an organization
- with <quote>center</quote> as part of its name where that organization is mentioned
- as part of a
- <code>CEO-of</code> relationship annotated by the relationship
- recognizer.</para>
-
- <para>For more details about using UIMA and Semantic Search see the section on
- integrating text analysis and search in <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application"/>.</para>
- </section>
-
- <section id="ugr.ovv.conceptual.databases">
- <title>Databases</title>
-
- <para>Search engine indices are not the only place to deposit analysis results for use
- by applications. Another classic example is populating databases. While many
- approaches are possible with varying degrees of flexibly and performance all are
- highly dependent on application specifics. We included a simple sample CAS Consumer
- that provides the basics for getting your analysis result into a relational
- database. It extracts annotations from a CAS and writes them to a relational
- database, using the open source Apache Derby database.</para>
- </section>
- </section>
-
- <section id="ugr.ovv.conceptual.multimodal_processing">
- <title>Multimodal Processing in UIMA</title>
- <para>In previous sections we've seen how the CAS is initialized with an initial
- artifact that will be subsequently analyzed by Analysis engines and CAS Consumers. The
- first Analysis engine may make some assertions about the artifact, for example, in the
- form of annotations. Subsequent Analysis engines will make further assertions about
- both the artifact and previous analysis results, and finally one or more CAS Consumers
- will extract information from these CASs for structured information storage.</para>
- <figure id="ugr.ovv.conceptual.fig.multiple_sofas">
- <title>Multiple Sofas in support of multi-modal analysis of an audio Stream. Some
- engines work on the audio <quote>view</quote>, some on the text
- <quote>view</quote> and some on both.</title>
- <mediaobject>
- <imageobject role="html">
- <imagedata width="576px" format="PNG" align="center" fileref="&imgroot;image014.png"/>
- </imageobject>
- <imageobject role="fo">
- <imagedata width="5.5in" format="PNG" align="center" fileref="&imgroot;image014.png"/>
- </imageobject>
- <textobject><phrase>Picture showing audio on the left broken into segments by a
- segmentation component, then sent to multiple analysis pipelines in parallel,
- some processing the raw audio, others processing the recognized speech as
- text.</phrase></textobject>
- </mediaobject>
- </figure>
- <para>Consider a processing pipeline, illustrated in <xref
- linkend="ugr.ovv.conceptual.fig.multiple_sofas"/>, that starts with an
- audio recording of a conversation, transcribes the audio into text, and then extracts
- information from the text transcript. Analysis Engines at the start of the pipeline are
- analyzing an audio subject of analysis, and later analysis engines are analyzing a text
- subject of analysis. The CAS Consumer will likely want to build a search index from
- concepts found in the text to the original audio segment covered by the concept.</para>
-
- <para>What becomes clear from this relatively simple scenario is that the CAS must be
- capable of simultaneously holding multiple subjects of analysis. Some analysis
- engine will analyze only one subject of analysis, some will analyze one and create
- another, and some will need to access multiple subjects of analysis at the same time.
- </para>
-
- <para>The support in UIMA for multiple subjects of analysis is called <emphasis
- role="bold">Sofa</emphasis> support; Sofa is an acronym which is derived from
- <emphasis role="underline">S</emphasis>ubject <emphasis role="underline">
- of</emphasis> <emphasis role="underline">A</emphasis>nalysis, which is a physical
- representation of an artifact (e.g., the detagged text of a web-page, the HTML
- text of the same web-page, the audio segment of a video, the close-caption text
- of the same audio segment). A Sofa may
- be associated with CAS Views. A particular CAS will have one or more views, each view
- corresponding to a particular subject of analysis, together with a set of the defined
- indexes that index the metadata created in that view.</para>
-
- <para>Analysis results can be indexed in, or <quote>belong</quote> to, a specific view.
- UIMA components may be written in <quote>Multi-View</quote> mode - able to create and
- access multiple Sofas at the same time, or in <quote>Single-View</quote> mode, simply
- receiving a particular view of the CAS corresponding to a particular single Sofa. For
- single-view mode components, it is up to the person assembling the component to supply
- the needed information to insure a particular view is passed to the component at run
- time. This is done using XML descriptors for Sofa mapping (see <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.mvs.sofa_name_mapping"/>).</para>
-
- <para>Multi-View capability brings benefits to text-only processing as well. An input
- document can be transformed from one format to another. Examples of this include
- transforming text from HTML to plain text or from one natural language to another.
- </para>
- </section>
-
- <section id="ugr.ovv.conceptual.next_steps">
- <title>Next Steps</title>
-
- <para>This chapter presented a high-level overview of UIMA concepts. Along the way, it
- pointed to other documents in the UIMA SDK documentation set where the reader can find
- details on how to apply the related concepts in building applications with the UIMA
- SDK.</para>
-
- <para>At this point the reader may return to the documentation guide in <olink
- targetdoc="&uima_docs_overview;" targetptr="ugr.project_overview_doc_use"/>
- to learn how they might proceed in getting started using UIMA.</para>
-
- <para>For a more detailed overview of the UIMA architecture, framework and development
- roles we refer the reader to the following paper:</para>
-
- <para>D. Ferrucci and A. Lally, <quote>Building an example application using the
- Unstructured Information Management Architecture,</quote> <emphasis>IBM Systems
- Journal</emphasis> <emphasis role="bold">43</emphasis>, No. 3, 455-475 (2004).
- </para>
-
- <para>This paper can be found on line at <ulink
- url="http://www.research.ibm.com/journal/sj43-3.html"/></para>
- </section>
-
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
+"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd" [
+<!ENTITY key_concepts "Key UIMA Concepts Introduced in this Section:">
+<!ENTITY imgroot "../images/overview_and_setup/conceptual_overview_files/" >
+<!ENTITY % uimaents SYSTEM "../entities.ent" >
+%uimaents;
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<chapter id="ugr.ovv.conceptual">
+ <title>UIMA Conceptual Overview</title>
+
+ <para>UIMA is an open, industrial-strength, scaleable and extensible platform for
+ creating, integrating and deploying unstructured information management solutions
+ from powerful text or multi-modal analysis and search components. </para>
+
+ <para>The Apache UIMA project is an implementation of the Java UIMA framework available
+ under the Apache License, providing a common foundation for industry and academia to
+ collaborate and accelerate the world-wide development of technologies critical for
+ discovering vital knowledge present in the fastest growing sources of information
+ today.</para>
+
+ <para>This chapter presents an introduction to many essential UIMA concepts. It is meant to
+ provide a broad overview to give the reader a quick sense of UIMA's basic
+ architectural philosophy and the UIMA SDK's capabilities. </para>
+
+ <para>This chapter provides a general orientation to UIMA and makes liberal reference to
+ the other chapters in the UIMA SDK documentation set, where the reader may find detailed
+ treatments of key concepts and development practices. It may be useful to refer to <olink
+ targetdoc="&uima_docs_overview;" targetptr="ugr.glossary"/>, to become familiar
+ with the terminology in this overview.</para>
+
+ <section id="ugr.ovv.conceptual.uima_introduction">
+ <title>UIMA Introduction</title>
+ <figure id="ugr.ovv.conceptual.fig.bridge">
+ <title>UIMA helps you build the bridge between the unstructured and structured
+ worlds</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.5in" format="PNG" fileref="&imgroot;image002.png"/>
+ </imageobject>
+ <textobject><phrase>Picture of a bridge between unstructured information
+ artifacts and structured metadata about those artifacts</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+
+ <para> Unstructured information represents the largest, most current and fastest
+ growing source of information available to businesses and governments. The web is just
+ the tip of the iceberg. Consider the mounds of information hosted in the enterprise and
+ around the world and across different media including text, voice and video. The
+ high-value content in these vast collections of unstructured information is,
+ unfortunately, buried in lots of noise. Searching for what you need or doing
+ sophisticated data mining over unstructured information sources presents new
+ challenges. </para>
+
+ <para>An unstructured information management (UIM) application may be generally
+ characterized as a software system that analyzes large volumes of unstructured
+ information (text, audio, video, images, etc.) to discover, organize and deliver
+ relevant knowledge to the client or application end-user. An example is an application
+ that processes millions of medical abstracts to discover critical drug interactions.
+ Another example is an application that processes tens of millions of documents to
+ discover key evidence indicating probable competitive threats. </para>
+
+ <para>First and foremost, the unstructured data must be analyzed to interpret, detect
+ and locate concepts of interest, for example, named entities like persons,
+ organizations, locations, facilities, products etc., that are not explicitly tagged
+ or annotated in the original artifact. More challenging analytics may detect things
+ like opinions, complaints, threats or facts. And then there are relations, for
+ example, located in, finances, supports, purchases, repairs etc. The list of concepts
+ important for applications to discover in unstructured content is large, varied and
+ often domain specific.
+ Many different component analytics may solve different parts of the overall analysis task.
+ These component analytics must interoperate and must be easily combined to facilitate
+ the developed of UIM applications.</para>
+
+ <para>The result of analysis are used to populate structured forms so that conventional
+ data processing and search technologies
+ like search engines, database engines or OLAP
+ (On-Line Analytical Processing, or Data Mining) engines
+ can efficiently deliver the newly discovered content in response to the client requests
+ or queries.</para>
+
+ <para>In analyzing unstructured content, UIM applications make use of a variety of
+ analysis technologies including:</para>
+
+ <itemizedlist spacing="compact">
+ <listitem><para>Statistical and rule-based Natural Language Processing
+ (NLP)</para>
+ </listitem>
+ <listitem><para>Information Retrieval (IR)</para>
+ </listitem>
+ <listitem><para>Machine learning</para>
+ </listitem>
+ <listitem><para>Ontologies</para>
+ </listitem>
+ <listitem><para>Automated reasoning and</para>
+ </listitem>
+ <listitem><para>Knowledge Sources (e.g., CYC, WordNet, FrameNet, etc.)</para>
+ </listitem>
+
+ </itemizedlist>
+
+ <para>Specific analysis capabilities using these technologies are developed
+ independently using different techniques, interfaces and platforms.
+ </para>
+
+ <para>The bridge from the unstructured world to the structured world is built through the
+ composition and deployment of these analysis capabilities. This integration is often
+ a costly challenge. </para>
+
+ <para>The Unstructured Information Management Architecture (UIMA) is an architecture
+ and software framework that helps you build that bridge. It supports creating,
+ discovering, composing and deploying a broad range of analysis capabilities and
+ linking them to structured information services.</para>
+
+ <para>UIMA allows development teams to match the right skills with the right parts of a
+ solution and helps enable rapid integration across technologies and platforms using a
+ variety of different deployment options. These ranging from tightly-coupled
+ deployments for high-performance, single-machine, embedded solutions to parallel
+ and fully distributed deployments for highly flexible and scaleable
+ solutions.</para>
+
+ </section>
+
+ <section id="ugr.ovv.conceptual.architecture_framework_sdk">
+ <title>The Architecture, the Framework and the SDK</title>
+ <para>UIMA is a software architecture which specifies component interfaces, data
+ representations, design patterns and development roles for creating, describing,
+ discovering, composing and deploying multi-modal analysis capabilities.</para>
+
+ <para>The <emphasis role="bold">UIMA framework</emphasis> provides a run-time
+ environment in which developers can plug in their UIMA component implementations and
+ with which they can build and deploy UIM applications. The framework is not specific to
+ any IDE or platform. Apache hosts a Java and (soon) a C++ implementation of the UIMA
+ Framework.</para>
+
+ <para>The <emphasis role="bold">UIMA Software Development Kit (SDK)</emphasis>
+ includes the UIMA framework, plus tools and utilities for using UIMA. Some of the
+ tooling supports an Eclipse-based ( <ulink url="http://www.eclipse.org/"/>)
+ development environment. </para>
+
+ </section>
+
+ <section id="ugr.ovv.conceptual.analysis_basics">
+ <title>Analysis Basics</title>
+ <note><title>&key_concepts;</title><para>Analysis Engine, Document, Annotator, Annotator
+ Developer, Type, Type System, Feature, Annotation, CAS, Sofa, JCas, UIMA
+ Context.</para>
+ </note>
+
+ <section id="ugr.ovv.conceptual.aes_annotators_and_analysis_results">
+ <title>Analysis Engines, Annotators & Results</title>
+ <figure id="ugr.ovv.conceptual.metadata_in_cas">
+ <title>Objects represented in the Common Analysis Structure (CAS)</title>
+ <mediaobject>
+ <imageobject role="html">
+ <imagedata format="PNG" width="594px" align="center" fileref="&imgroot;image004.png"/>
+ </imageobject>
+ <imageobject role="fo">
+ <imagedata format="PNG" width="5.5in" align="center" fileref="&imgroot;image004.png"/>
+ </imageobject>
+ <textobject><phrase>Picture of some text, with a hierarchy of discovered
+ metadata about words in the text, including some image of a person as metadata
+ about that name.</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+
+ <para>UIMA is an architecture in which basic building blocks called Analysis Engines
+ (AEs) are composed to analyze a document and infer and record descriptive attributes
+ about the document as a whole, and/or about regions therein. This descriptive
+ information, produced by AEs is referred to generally as <emphasis role="bold">
+ analysis results</emphasis>. Analysis results typically represent meta-data
+ about the document content. One way to think about AEs is as software agents that
+ automatically discover and record meta-data about original content.</para>
+
+ <para>UIMA supports the analysis of different modalities including text, audio and
+ video. The majority of examples we provide are for text. We use the term <emphasis
+ role="bold">document, </emphasis>therefore, to generally refer to any unit of
+ content that an AE may process, whether it is a text document or a segment of audio, for
+ example. See the section <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.mvs"/> for more information on multimodal processing
+ in UIMA.</para>
+
+ <para>Analysis results include different statements about the content of a document.
+ For example, the following is an assertion about the topic of a document:</para>
+
+
+ <programlisting>(1) The Topic of document D102 is "CEOs and Golf".</programlisting>
+
+ <para>Analysis results may include statements describing regions more granular than
+ the entire document. We use the term <emphasis role="bold">span</emphasis> to
+ refer to a sequence of characters in a text document. Consider that a document with the
+ identifier D102 contains a span, <quote>Fred Centers</quote> starting at
+ character position 101. An AE that can detect persons in text may represent the
+ following statement as an analysis result:</para>
+
+
+ <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
+
+ <para>In both statements 1 and 2 above there is a special pre-defined term or what we call
+ in UIMA a <emphasis role="bold">Type</emphasis>. They are
+ <emphasis>Topic</emphasis> and <emphasis>Person</emphasis> respectively.
+ UIMA types characterize the kinds of results that an AE may create – more on
+ types later.</para>
+
+ <para>Other analysis results may relate two statements. For example, an AE might
+ record in its results that two spans are both referring to the same person:</para>
+
+
+ <programlisting>(3) The Person denoted by span 101 to 112 and
+ the Person denoted by span 141 to 143 in document D102
+ refer to the same Entity.</programlisting>
+
+ <para>The above statements are some examples of the kinds of results that AEs may record
+ to describe the content of the documents they analyze. These are not meant to indicate
+ the form or syntax with which these results are captured in UIMA – more on that
+ later in this overview.</para>
+
+ <para>The UIMA framework treats Analysis engines as pluggable, composible,
+ discoverable, managed objects. At the heart of AEs are the analysis algorithms that
+ do all the work to analyze documents and record analysis results. </para>
+
+ <para>UIMA provides a basic component type intended to house the core analysis
+ algorithms running inside AEs. Instances of this component are called <emphasis
+ role="bold">Annotators</emphasis>. The analysis algorithm developer's
+ primary concern therefore is the development of annotators. The UIMA framework
+ provides the necessary methods for taking annotators and creating analysis
+ engines.</para>
+
+ <para>In UIMA the person who codes analysis algorithms takes on the role of the
+ <emphasis role="bold">Annotator Developer</emphasis>. <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.aae"/> will take the reader
+ through the details involved in creating UIMA annotators and analysis
+ engines.</para>
+
+ <para>At the most primitive level an AE wraps an annotator adding the necessary APIs and
+ infrastructure for the composition and deployment of annotators within the UIMA
+ framework. The simplest AE contains exactly one annotator at its core. Complex AEs
+ may contain a collection of other AEs each potentially containing within them other
+ AEs. </para>
+ </section>
+
+ <section id="ugr.ovv.conceptual.representing_results_in_cas">
+ <title>Representing Analysis Results in the CAS</title>
+
+ <para>How annotators represent and share their results is an important part of the UIMA
+ architecture. UIMA defines a <emphasis role="bold">Common Analysis Structure
+ (CAS)</emphasis> precisely for these purposes.</para>
+
+ <para>The CAS is an object-based data structure that allows the representation of
+ objects, properties and values. Object types may be related to each other in a
+ single-inheritance hierarchy. The CAS logically (if not physically) contains the
+ document being analyzed. Analysis developers share and record their analysis
+ results in terms of an object model within the CAS. <footnote><para> We have plans to
+ extend the representational capabilities of the CAS and align its semantics with the
+ semantics of the OMG's Essential Meta-Object Facility (EMOF) and with the
+ semantics of the Eclipse Modeling Framework's ( <ulink
+ url="http://www.eclipse.org/emf/"/>) Ecore semantics and XMI-based
+ representation.</para> </footnote> </para>
+
+ <para>The UIMA framework includes an implementation and interfaces to the CAS. For a
+ more detailed description of the CAS and its interfaces see <olink
+ targetdoc="&uima_docs_ref;" targetptr="ugr.ref.cas"/>.</para>
+
+ <para>A CAS that logically contains statement 2 (repeated here for your
+ convenience)</para>
+
+
+ <programlisting>(2) The span from position 101 to 112 in document D102 denotes a Person</programlisting>
+
+ <para>would include objects of the Person type. For each person found in the body of a
+ document, the AE would create a Person object in the CAS and link it to the span of text
+ where the person was mentioned in the document.</para>
+
+ <para>While the CAS is a general purpose data structure, UIMA defines a
+ few basic types and affords the developer the ability to extend these to define an
+ arbitrarily rich <emphasis role="bold">Type System</emphasis>. You can think of a
+ type system as an object schema for the CAS.</para>
+
+ <para>A type system defines the various types of objects that may be discovered in
+ documents by AE's that subscribe to that type system.</para>
+
+ <para>As suggested above, Person may be defined as a type. Types have properties or
+ <emphasis role="bold">features</emphasis>. So for example,
+ <emphasis>Age</emphasis> and <emphasis>Occupation</emphasis> may be defined as
+ features of the Person type.</para>
+
+ <para>Other types might be <emphasis>Organization, Company, Bank, Facility, Money,
+ Size, Price, Phone Number, Phone Call, Relation, Network Packet, Product, Noun
+ Phrase, Verb, Color, Parse Node, Feature Weight Array</emphasis> etc.</para>
+
+ <para>There are no limits to the different types that may be defined in a type system. A
+ type system is domain and application specific.</para>
+
+ <para>Types in a UIMA type system may be organized into a taxonomy. For example,
+ <emphasis>Company</emphasis> may be defined as a subtype of
+ <emphasis>Organization</emphasis>. <emphasis>NounPhrase</emphasis> may be a
+ subtype of a <emphasis>ParseNode</emphasis>.</para>
+
+ <section id="ugr.ovv.conceptual.annotation_type">
+ <title>The Annotation Type</title>
+
+ <para>A general and common type used in artifact analysis and from which additional
+ types are often derived is the <emphasis role="bold">annotation</emphasis>
+ type. </para>
+
+ <para>The annotation type is used to annotate or label regions of an artifact. Common
+ artifacts are text documents, but they can be other things, such as audio streams.
+ The annotation type for text includes two features, namely
+ <emphasis>begin</emphasis> and <emphasis>end</emphasis>. Values of these
+ features represent integer offsets in the artifact and delimit a span. Any
+ particular annotation object identifies the span it annotates with the
+ <emphasis>begin</emphasis> and <emphasis>end</emphasis> features.</para>
+
+ <para>The key idea here is that the annotation type is used to identify and label or
+ <quote>annotate</quote> a specific region of an artifact.</para>
+
+ <para>Consider that the Person type is defined as a subtype of annotation. An
+ annotator, for example, can create a Person annotation to record the discovery of a
+ mention of a person between position 141 and 143 in document D102. The annotator can
+ create another person annotation to record the detection of a mention of a person in
+ the span between positions 101 and 112. </para>
+ </section>
+ <section id="ugr.ovv.conceptual.not_just_annotations">
+ <title>Not Just Annotations</title>
+
+ <para>While the annotation type is a useful type for annotating regions of a
+ document, annotations are not the only kind of types in a CAS. A CAS is a general
+ representation scheme and may store arbitrary data structures to represent the
+ analysis of documents.</para>
+
+ <para>As an example, consider statement 3 above (repeated here for your
+ convenience).</para>
+
+
+ <programlisting>(3) The Person denoted by span 101 to 112 and
+ the Person denoted by span 141 to 143 in document D102
+ refer to the same Entity.</programlisting>
+
+ <para>This statement mentions two person annotations in the CAS; the first, call it
+ P1 delimiting the span from 101 to 112 and the other, call it P2, delimiting the span
+ from 141 to 143. Statement 3 asserts explicitly that these two spans refer to the
+ same entity. This means that while there are two expressions in the text
+ represented by the annotations P1 and P2, each refers to one and the same person.
+ </para>
+
+ <para>The Entity type may be introduced into a type system to capture this kind of
+ information. The Entity type is not an annotation. It is intended to represent an
+ object in the domain which may be referred to by different expressions (or
+ mentions) occurring multiple times within a document (or across documents within
+ a collection of documents). The Entity type has a feature named
+ <emphasis>occurrences. </emphasis>This feature is used to point to all the
+ annotations believed to label mentions of the same entity.</para>
+
+ <para>Consider that the spans annotated by P1 and P2 were <quote>Fred
+ Center</quote> and <quote>He</quote> respectively. The annotator might create
+ a new Entity object called
+ <code>FredCenter</code>. To represent the relationship in statement 3 above,
+ the annotator may link FredCenter to both P1 and P2 by making them values of its
+ <emphasis>occurrences</emphasis> feature.</para>
+
+ <para> <xref linkend="ugr.ovv.conceptual.metadata_in_cas"/> also
+ illustrates that an entity may be linked to annotations referring to regions of
+ image documents as well. To do this the annotation type would have to be extended
+ with the appropriate features to point to regions of an image.</para>
+ </section>
+
+ <section id="ugr.ovv.conceptual.multiple_views_within_a_cas">
+ <title>Multiple Views within a CAS</title>
+
+ <para>UIMA supports the simultaneous analysis of multiple views of a document. This
+ support comes in handy for processing multiple forms of the artifact, for example, the audio
+ and the closed captioned views of a single speech stream, or the tagged and detagged
+ views of an HTML document.</para>
+
+ <para>AEs analyze one or more views of a document. Each view contains a specific
+ <emphasis role="bold">subject of analysis(Sofa)</emphasis>, plus a set of
+ indexes holding metadata indexed by that view. The CAS, overall, holds one or more
+ CAS Views, plus the descriptive objects that represent the analysis results for
+ each. </para>
+
+ <para>Another common example of using CAS Views is for different translations of a
+ document. Each translation may be represented with a different CAS View. Each
+ translation may be described by a different set of analysis results. For more
+ details on CAS Views and Sofas see <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.mvs"/> and <olink
+ targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aas"/>. </para>
+ </section>
+ </section>
+
+ <section id="ugr.ovv.conceptual.interacting_with_cas_and_external_resources">
+ <title>Interacting with the CAS and External Resources</title>
+ <titleabbrev>Using CASes and External Resources</titleabbrev>
+
+ <para>The two main interfaces that a UIMA component developer interacts with are the
+ CAS and the UIMA Context.</para>
+
+ <para>UIMA provides an efficient implementation of the CAS with multiple programming
+ interfaces. Through these interfaces, the annotator developer interacts with the
+ document and reads and writes analysis results. The CAS interfaces provide a suite of
+ access methods that allow the developer to obtain indexed iterators to the different
+ objects in the CAS. See <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.cas"/>. While many objects may exist in a CAS, the annotator
+ developer can obtain a specialized iterator to all Person objects associated with a
+ particular view, for example.</para>
+
+ <para>For Java annotator developers, UIMA provides the JCas. This interface provides
+ the Java developer with a natural interface to CAS objects. Each type declared in the
+ type system appears as a Java Class; the UIMA framework renders the Person type as a
+ Person class in Java. As the analysis algorithm detects mentions of persons in the
+ documents, it can create Person objects in the CAS. For more details on how to interact
+ with the CAS using this interface, refer to <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.jcas"/>.</para>
+
+ <para>The component developer, in addition to interacting with the CAS, can access
+ external resources through the framework's resource manager interface
+ called the <emphasis role="bold">UIMA Context</emphasis>. This interface, among
+ other things, can ensure that different annotators working together in an aggregate
+ flow may share the same instance of an external file, for example. For details on using
+ the UIMA Context see <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.aae"/>.</para>
+
+ </section>
+ <section id="ugr.ovv.conceptual.component_descriptors">
+ <title>Component Descriptors</title>
+ <para>UIMA defines interfaces for a small set of core components that users of the
+ framework provide implmentations for. Annotators and Analysis Engines are two of
+ the basic building blocks specified by the architecture. Developers implement them
+ to build and compose analysis capabilities and ultimately applications.</para>
+
+ <para>There are others components in addition to these, which we will learn about
+ later, but for every component specified in UIMA there are two parts required for its
+ implementation:</para>
+
+ <orderedlist spacing="compact">
+ <listitem><para>the declarative part and</para></listitem>
+ <listitem><para>the code part.</para></listitem>
+ </orderedlist>
+
+ <para>The declarative part contains metadata describing the component, its
+ identity, structure and behavior and is called the <emphasis role="bold">
+ Component Descriptor</emphasis>. Component descriptors are represented in XML.
+ The code part implements the algorithm. The code part may be a program in Java.</para>
+
+ <para>As a developer using the UIMA SDK, to implement a UIMA component it is always the
+ case that you will provide two things: the code part and the Component Descriptor.
+ Note that when you are composing an engine, the code may be already provided in
+ reusable subcomponents. In these cases you may not be developing new code but rather
+ composing an aggregate engine by pointing to other components where the code has been
+ included.</para>
+
+ <para>Component descriptors are represented in XML and aid in component discovery,
+ reuse, composition and development tooling. The UIMA SDK provides tools for easily
+ creating and maintaining the component descriptors that relieve the developer from
+ editing XML directly. This tool is described briefly in <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.aae"/>, and more
+ thoroughly in <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde"/>
+ .</para>
+
+ <para>Component descriptors contain standard metadata including the
+ component's name, author, version, and a reference to the class that
+ implements the component.</para>
+
+ <para>In addition to these standard fields, a component descriptor identifies the
+ type system the component uses and the types it requires in an input CAS and the types it
+ plans to produce in an output CAS.</para>
+
+ <para>For example, an AE that detects person types may require as input a CAS that
+ includes a tokenization and deep parse of the document. The descriptor refers to a
+ type system to make the component's input requirements and output types
+ explicit. In effect, the descriptor includes a declarative description of the
+ component's behavior and can be used to aid in component discovery and
+ composition based on desired results. UIMA analysis engines provide an interface
+ for accessing the component metadata represented in their descriptors. For more
[... 484 lines stripped ...]