You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2010/05/06 16:01:57 UTC

svn commit: r941739 [5/5] - in /uima/uimaj/branches/mavenAlign/uima-docbook-references: ./ src/ src/docbook/ src/docbook/images/ src/docbook/images/references/ src/docbook/images/references/ref.cas/ src/docbook/images/references/ref.javadocs/ src/docbo...

Added: uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml
URL: http://svn.apache.org/viewvc/uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml?rev=941739&view=auto
==============================================================================
--- uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml (added)
+++ uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml Thu May  6 14:01:56 2010
@@ -0,0 +1,1368 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+<!ENTITY imgroot "images/references/ref.xml.cpe_descriptor/">
+<!ENTITY tp "ugr.ref.xml.cpe_descriptor.">
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >  
+%uimaents;
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<chapter id="ugr.ref.xml.cpe_descriptor">
+  <title>Collection Processing Engine Descriptor Reference</title>
+  <titleabbrev>CPE Descriptor Reference</titleabbrev>
+  
+  <para>A UIMA <emphasis>Collection Processing Engine</emphasis> (CPE) is a combination
+    of UIMA components assembled to analyze a collection of artifacts. A CPE is an
+    instantiation of the UIMA <emphasis>Collection Processing Architecture</emphasis>,
+    which defines the collection processing components, interfaces, and APIs. A CPE is
+    executed by a UIMA framework component called the <emphasis>Collection Processing
+    Manager</emphasis> (CPM), which provides a number of services for deploying CPEs,
+    running CPEs, and handling errors.</para>
+  
+  <para>A CPE can be assembled programmatically within a Java application, or it can be
+    assembled declaratively via a CPE configuration specification, called a CPE
+    Descriptor. This chapter describes the format of the CPE Descriptor.</para>
+  
+  <para>Details about the CPE, including its function, sub-components, APIs, and related
+    tools, can be found in <olink targetdoc="&uima_docs_tutorial_guides;"
+      targetptr="ugr.tug.cpe"/>. Here we briefly summarize the CPE to define terms and
+    provide context for the later sections that describe the CPE Descriptor.</para>
+  
+  <section id="&tp;overview">
+    <title>CPE Overview</title>
+    
+    <figure id="&tp;overview.fig.runtime">
+      <title>CPE Runtime Overview</title>
+      <mediaobject>
+        <imageobject>
+          <imagedata width="5.8in" format="PNG"
+            fileref="&imgroot;image002.png"/>
+        </imageobject>
+        <textobject><phrase>CPE Runtime Overview diagram</phrase></textobject>
+      </mediaobject>
+    </figure>
+    
+    <para>An illustration of the CPE runtime is shown in <xref
+        linkend="&tp;overview.fig.runtime"/>. Some of the CPE components, such as the
+      <emphasis>queues</emphasis> and <emphasis>processing pipelines</emphasis>, are
+      internal to the CPE, but their behavior and deployment may be configured using the CPE
+      Descriptor. Other CPE components, such as the <emphasis>Collection
+      Reader</emphasis> and <emphasis>CAS Processors</emphasis>, are defined and
+      configured externally from the CPE and then plugged in to the CPE to create the overall
+      engine. The parts of a CPE are:
+      
+      <variablelist>
+        <varlistentry>
+          <term>Collection Reader</term>
+          <listitem><para>understands the native data collection format and iterates
+            over the collection producing subjects of analysis</para></listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>CAS Initializer<footnote><para>Deprecated</para></footnote>
+            </term>
+          <listitem><para>initializes a CAS with a subject of analysis</para>
+            </listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>Artifact Producer</term>
+          <listitem><para>asynchronously pulls CASes from the Collection Reader,
+            creates batches of CASes and puts them into the work queue</para></listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>Work Queue</term>
+          <listitem><para>shared queue containing batches of CASes queued by the Artifact
+            Producer for analysis by Analysis Engines</para>
+          </listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>B1-Bn</term>
+          <listitem><para>individual batches containing 1 or more CASes</para>
+            </listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>AE1-AEn</term>
+          <listitem><para>Analysis Engines arranged by a CPE descriptor</para>
+            </listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>Processing Pipelines</term>
+          <listitem><para>each pipeline runs in a separate thread and contains a
+            replicated set of the Analysis Engines running in the defined sequence</para>
+            </listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>Output Queue</term>
+          <listitem><para>holds batches of CASes with analysis results intended for CAS
+            Consumers</para></listitem>
+        </varlistentry>
+        
+        <varlistentry>
+          <term>CAS Consumers</term>
+          <listitem><para>perform collection level analysis over the CASes and extract
+            analysis results, e.g., creating indexes or databases</para></listitem>
+        </varlistentry>
+      </variablelist>
+      </para>
+  </section>
+  
+  <section id="&tp;notation">
+    <title>Notation</title>
+    
+    <para>CPE Descriptors are XML files. This chapter uses an informal notation to specify
+      the syntax of CPE Descriptors.</para>
+    
+    <para>The notation used in this chapter is:
+      
+      <itemizedlist><listitem><para>An ellipsis (...) inside an element body indicates
+        that the substructure of that element has been omitted (to be described in another
+        section of this chapter). An example of this would be:
+        
+        
+        <programlisting>&lt;collectionReader&gt;
+...
+&lt;/collectionReader&gt;</programlisting></para>
+        </listitem>
+        
+        <listitem><para>An ellipsis immediately after an element indicates that the
+          element type may be repeated arbitrarily many times. For example:
+          
+          
+          <programlisting>&lt;parameter&gt;[String]&lt;/parameter&gt;
+&lt;parameter&gt;[String]&lt;/parameter&gt;
+...</programlisting>
+          indicates that there may be arbitrarily many parameter elements in this
+          context.</para></listitem>
+        
+        <listitem><para>An ellipsis inside an element means details of the attributes
+          associated with that element are defined later, e.g.:
+          
+          <programlisting>&lt;casProcessor ...&gt;</programlisting></para>
+          </listitem>
+        
+        <listitem><para>Bracketed expressions (e.g. <literal>[String]</literal>)
+          indicate the type of value that may be used at that location.</para></listitem>
+        
+        <listitem><para>A vertical bar, as in <literal>true|false</literal>, indicates
+          alternatives. This can be applied to literal values, bracketed type names, and
+          elements. </para></listitem></itemizedlist></para>
+    
+    <para>Which elements are optional and which are required is specified in prose, not in the
+      syntax definition.</para>
+    
+  </section>
+  
+  <section id="&tp;imports">
+    <title>Imports</title>
+    
+    <para>As of version 2.2, a CPE Descriptor can use the same <literal>import</literal> mechanism
+      as other component descriptors.  This allows referring to component
+      descriptors using either relative paths (resolved relative to the location of the CPE descriptor)
+      or the classpath/datapath.  For details see <olink targetdoc="&uima_docs_ref;"
+      targetptr="ugr.ref.xml.component_descriptor"/>.</para>
+     
+    <para>The follwing older syntax is still supported, but <emphasis>not recommended</emphasis>:
+      
+      <programlisting><![CDATA[<descriptor>
+    <include href="[URL or File]"/>
+</descriptor>]]></programlisting></para>
+    
+    <para>The <literal>[URL or File]</literal> attribute is a URL or a filename for the descriptor of the
+      incorporated component. The argument is first attempted to be resolved as a URL.</para>
+    
+    <para>
+      Relative paths in an <literal>include</literal> are resolved relative to the current working directory 
+      (NOT the CPE descriptor location as is the case for <literal>import</literal>). 
+      A filename relative to another directory can be specified using the <literal>CPM_HOME</literal>
+      variable, e.g.,    
+    <programlisting>&lt;descriptor&gt;
+    &lt;include href="${CPM_HOME}/desc_dir/descriptor.xml"/&gt;
+&lt;/descriptor&gt;</programlisting>
+    
+      In this case, the value for the <literal>CPM_HOME</literal> variable must be
+      provided to the CPE by specifying it on the Java command line, e.g.,
+        
+    <programlisting>java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</programlisting>
+    
+  </para>
+    
+  </section>
+  
+  <section id="&tp;descriptor">
+    <title>CPE Descriptor Overview</title>
+    
+    <para>A CPE Descriptor consists of information describing the following four main
+      elements.</para>
+    
+    <orderedlist><listitem><para>The <emphasis>Collection Reader</emphasis>, which
+      is responsible for gathering artifacts and initializing the Common Analysis
+      Structure (CAS) used to support processing in the UIMA collection processing
+      engine.</para></listitem>
+      
+      <listitem><para>The <emphasis>CAS Processors</emphasis>, responsible for
+        analyzing individual artifacts, analyzing across artifacts, and extracting
+        analysis results. CAS Processors include <emphasis>Analysis Engines</emphasis>
+        and <emphasis>CAS Consumers</emphasis>.</para></listitem>
+      
+      <listitem><para>Operational parameters of the <emphasis>Collection Processing
+        Manager</emphasis> (CPM), such as checkpoint frequency and deployment
+        mode.</para></listitem>
+      
+      <listitem><para>Resource Manager Configuration (optional). </para></listitem>
+      </orderedlist>
+    
+    <para>The CPE Descriptor has the following high level skeleton:
+      
+      
+      <programlisting><![CDATA[<?xml version="1.0"?>
+<cpeDescription>
+   <collectionReader>
+...
+   </collectionReader>
+   <casProcessors>
+...
+   </casProcessors>
+   <cpeConfig>
+...
+   </cpeConfig>
+   <resourceManagerConfiguration>
+...
+   </resourceManagerConfiguration>
+</cpeDescription>]]></programlisting></para>
+    
+    <para>Details of each of the four main elements are described in the sections that
+      follow.</para>
+ </section>   
+    <section id="&tp;descriptor.collection_reader">
+      <title>Collection Reader</title>
+      
+      <para>The <literal>&lt;collectionReader&gt;</literal> section identifies the
+        Collection Reader and optional CAS Initializer that are to be used in the CPE. The
+        Collection Reader is responsible for retrieval of artifacts from a collection
+        outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2)
+        is responsible for initializing the CAS with the artifact.</para>
+      
+      <para>A Collection Reader may initialize the CAS itself, in which case it does not
+        require a CAS Initializer. This should be clearly specified in the documentation for
+        the Collection Reader. Specifying a CAS Initializer for a Collection Reader that
+        does not make use of a CAS Initializer will not cause an error, but the specified CAS
+        Initializer will not be used.</para>
+      
+      <para>The complete structure of the <literal>&lt;collectionReader&gt;</literal>
+        section is:
+        
+        
+        <programlisting><![CDATA[<collectionReader>
+  <collectionIterator>
+    <descriptor>
+      <import ...> | <include .../>
+    </descriptor>
+    <configurationParameterSettings>...</configurationParameterSettings>
+    <sofaNameMappings>...</sofaNameMappings>
+  </collectionIterator>
+  <casInitializer>
+    <descriptor>
+      <import ...> | <include .../>
+    </descriptor>
+    <configurationParameterSettings>...</configurationParameterSettings>
+    <sofaNameMappings>...</sofaNameMappings>
+  </casInitializer>
+</collectionReader>]]></programlisting></para>
+      
+      <para>The <literal>&lt;collectionIterator&gt;</literal> identifies the
+        descriptor for the Collection Reader, and the <literal>&lt;casInitializer&gt;
+        </literal>identifies the descriptor for the CAS Initializer. The format and
+        details of the Collection Reader and CAS Initializer descriptors are described in
+          <olink targetdoc="&uima_docs_ref;"
+          targetptr="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader"/>
+        . The <literal>&lt;configurationParameterSettings&gt; </literal>and the
+        <literal>&lt;sofaNameMappings&gt;</literal> elements are described in the next
+        section.</para>
+      
+      <section id="&tp;descriptor.collection_reader.error_handling">
+        <title>Error handling for Collection Readers</title>
+        
+        <para>The CPM will abort if the Collection Reader throws a large number of
+          consecutive exceptions (default = 100). This default can by changed by using the
+          Java initialization parameter <literal>-DMaxCRErrorThreshold
+          xxx.</literal></para>
+      </section>
+    </section>
+    
+    <section id="&tp;descriptor.cas_processors">
+      <title>CAS Processors</title>
+      
+      <para>The <literal>&lt;casProcessors&gt;</literal> section identifies the
+        components that perform the analysis on the input data, including CAS analysis
+        (Analysis Engines) and analysis results extraction (CAS Consumers). The CAS
+        Consumers may also perform collection level analysis, where the analysis is
+        performed (or aggregated) over multiple CASes. The basic structure of the CAS
+        Processors section is:
+        
+        
+        <programlisting><![CDATA[<casProcessors 
+    dropCasOnException="true|false"
+    casPoolSize="[Number]" 
+    processingUnitThreadCount="[Number]">
+
+  <casProcessor ...>
+        ...
+  </casProcessor>
+
+  <casProcessor ...>
+        ...
+  </casProcessor>
+    ...
+</casProcessors>]]></programlisting></para>
+      
+      <para>The <literal>&lt;casProcessors&gt;</literal> section has two mandatory
+        attributes and one optional attribute that configure the characteristics of the CAS
+        Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which
+        defines the fixed number of CAS instances that the CPM will create and use during
+        processing. All CAS instances are maintained in a CAS Pool with a check-in and
+        check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader
+        and initialized with an initial subject of analysis. The CAS is checked-in into the
+        CAS Pool when it is completely processed, at the end of the processing chain. A larger
+        CAS Pool size will result in more memory being used by the CPM. CAS objects can be large
+        and care should be taken to determine the optimum size of the CAS Pool, weighing memory
+        tradeoffs with performance.</para>
+      
+      <para>The second mandatory <literal>&lt;casProcessors&gt;</literal> attribute
+        is <literal>processingUnitThreadCount</literal>, which specifies the number of
+        replicated <emphasis>Processing Pipelines</emphasis>. Each Processing
+        Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits
+        each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline
+        contains one or more Analysis Engines invoked in a given sequence. If more than one
+        Processing Pipeline is specified, the CPM replicates instances of each Analysis
+        Engine defined in the CPE descriptor. Each Processing Pipeline thread runs
+        independently, consuming CASes from work queue and depositing CASes with analysis
+        results onto the output queue. On multiprocessor machines, multiple Processing
+        Pipelines can run in parallel, improving overall throughput of the CPM.</para>
+      <note><para>The number of Processing Pipelines should be equal to or greater than CAS
+      Pool size. </para></note>
+      
+      <para>Elements in the pipeline (each represented by a &lt;casProcessor&gt; element)
+        may indicate that they do not permit multiple deployment in their Analysis Engine
+        descriptor. If so, even though multiple pipelines are being used, all CASes passing
+        through the pipelines will be routed through one instance of these marked Engines.
+        </para>
+      
+      <para>The final, optional, &lt;casProcessors&gt; attribute is
+        <literal>dropCasOnException</literal>. It defines a policy that determines what
+        happens with the CAS when an exception happens during processing. If the value of this
+        attribute is set to true and an exception happens, the CPM will notify all registered
+        listeners of the exception (see <olink targetdoc="&uima_docs_tutorial_guides;"
+          targetptr="ugr.tug.cpe.using_listeners"/>), clear the CAS and check the CAS
+        back into the CAS Pool so that it can be re-used. The presumption is that an exception
+        may leave the CAS in an inconsistent state and therefore that CAS should not be allowed
+        to move through the processing chain. When this attribute is omitted the CPM&apos;s
+        default is the same as specifying
+        <literal>dropCasOnException="false"</literal>.</para>
+      
+      <section id="&tp;descriptor.cas_processors.individual">
+        <title>Specifying an Individual CAS Processor</title>
+        
+        <para>The CAS Processors that make up the Processing Pipeline and the CAS Consumer
+          pipeline are specified with the <literal>&lt;casProcessor&gt;</literal>
+          entity, which appears within the <literal>&lt;casProcessors&gt;</literal>
+          entity. It may appear multiple times, once for each CAS Processor specified for
+          this CPE.</para>
+        
+        <para>The order of the <literal>&lt;casProcessor&gt;</literal> entities with
+          the <literal>&lt;casProcessors&gt;</literal> section specifies the order in
+          which the CAS Processors will run. Although CAS Consumers are usually put at the end
+          of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS
+          Consumers.</para>
+        
+        <para>The overall format of the <literal>&lt;casProcessor&gt;</literal> entity
+          is:
+          
+          
+          <programlisting><![CDATA[<casProcessor deployment="local|remote|integrated" name="[String]" >
+    <descriptor>
+      <import ...> | <include .../>
+    </descriptor>
+    <configurationParameterSettings>...</configurationParameterSettings>
+    <sofaNameMappings>...</sofaNameMappings>
+    <runInSeparateProcess>...</runInSeparateProcess>
+    <deploymentParameters>...</deploymentParameters>
+    <filter/>
+    <errorHandling>...</errorHandling>
+    <checkpoint batch="Number"/>
+</casProcessor>]]></programlisting></para>
+        
+        <para>The <literal>&lt;casProcessor&gt;</literal> element has two mandatory
+          attributes, <literal>deployment</literal> and <literal>name</literal>. The
+          mandatory <literal>name</literal> attribute specifies a unique string
+          identifying the CAS Processor.</para>
+        
+        <para>The mandatory <literal>deployment</literal> attribute specifies the CAS
+          Processor deployment mode. Currently, three deployment options are supported:
+          
+          <variablelist>
+            <varlistentry>
+              <term>integrated</term>
+              <listitem><para>indicates <emphasis>integrated</emphasis> deployment
+                of the CAS Processor. The CPM deploys and collocates the CAS Processor in the
+                same process space as the CPM. This type of deployment is recommended to
+                increase the performance of the CPE. However, it is NOT recommended to
+                deploy annotators containing JNI this way. Such CAS Processors may cause a
+                fatal exception and force the JVM to exit without cleanup (bringing down the
+                CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed
+                this way.</para>
+                <para>The descriptor for an integrated deployment can, in fact, be a remote
+                  service descriptor. When used this way, however, the CPM error recovery 
+                  options (see below) operate in the integrated mode, which means that many 
+                  of the retry options are not available.</para></listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>remote</term>
+              <listitem><para>indicates <emphasis>non-managed</emphasis>
+                deployment of the CAS Processor. The CAS Processor descriptor referenced
+                in the <literal>&lt;descriptor&gt;</literal> element must be a Vinci
+                <emphasis>Service Client Descriptor</emphasis>, which identifies a
+                remotely deployed CAS Processor service (see <olink
+                  targetdoc="&uima_docs_tutorial_guides;"
+                  targetptr="ugr.tug.application.remote_services"/>). The CPM
+                assumes that the CAS Processor is already running as a remote service and
+                will connect to it using the URI provided in the client service descriptor.
+                The lifecycle of a remotely deployed CAS Processor is not managed by the CPM,
+                so appropriate infrastructure should be in place to start/restart such CAS
+                Processors when necessary. This deployment provides fault isolation and
+                is implementation (i.e., programming language) neutral.</para>
+                </listitem>
+            </varlistentry>
+            <varlistentry>
+              <term>local</term>
+              <listitem><para>indicates <emphasis>managed</emphasis> deployment of
+                the CAS Processor. The CAS Processor descriptor referenced in the
+                <literal>&lt;descriptor&gt;</literal> element must be a Vinci
+                <emphasis>Service Deployment Descriptor</emphasis>, which configures
+                a CAS Processor for deployment as a Vinci service (see <olink
+                  targetdoc="&uima_docs_tutorial_guides;"
+                  targetptr="ugr.tug.application.remote_services"/>). The CPM
+                deploys the CAS Processor in a separate process and manages the life cycle
+                (start/stop) of the CAS Processor. Communication between the CPM and the
+                CAS Processor is done with Vinci. When the CPM completes processing, the
+                process containing the CAS Processor is terminated. This deployment mode
+                insulates the CPM from the CAS Processor, creating a more robust deployment
+                at the cost of a small communication overhead. On multiprocessor machines,
+                the separate processes may run concurrently and improve overall
+                throughput.</para></listitem>
+            </varlistentry>
+          </variablelist></para>
+        
+        <para>A number of elements may appear within the
+          <literal>&lt;casProcessor&gt;</literal> element.</para>
+        
+        <section id="&tp;descriptor.cas_processors.individual.descriptor">
+          <title>&lt;descriptor&gt; Element</title>
+          
+          <para>The <literal>&lt;descriptor&gt;</literal> element is mandatory. It
+            identifies the descriptor for the referenced CAS Processor using the syntax
+            described in <olink targetdoc="&uima_docs_ref;"
+              targetptr="ugr.ref.xml.component_descriptor.aes"/>.
+            
+            <itemizedlist spacing="compact"><listitem><para>For
+              <emphasis><literal>remote</literal></emphasis> CAS Processors, the
+              referenced descriptor must be a Vinci <emphasis>Service Client
+              Descriptor</emphasis>, which identifies a remotely deployed CAS Processor
+              service.</para></listitem>
+              
+              <listitem><para>For <emphasis>local</emphasis> CAS Processors, the
+                referenced descriptor must be a Vinci <emphasis>Service Deployment
+                Descriptor</emphasis>.</para></listitem>
+              
+              <listitem><para>For <emphasis>integrated</emphasis> CAS Processors,
+                the referenced descriptor must be an Analysis Engine Descriptor
+                (primitive or aggregate). </para></listitem></itemizedlist> </para>
+          
+          <para>See <olink targetdoc="&uima_docs_tutorial_guides;"
+              targetptr="ugr.tug.application.remote_services"/> for more
+            information on creating these descriptors and deploying services.</para>
+          
+        </section>
+        
+        <section
+          id="&tp;descriptor.cas_processors.individual.configuration_parameter_settings">
+          <title>&lt;configurationParameterSettings&gt; Element</title>
+          
+          <para>This element provides a way to override the contained Analysis
+            Engine&apos;s parameters settings. Any entry specified here must already be
+            defined; values specified replace the corresponding values for each
+            parameter. <emphasis role="bold-italic">For Cas Processors, this mechanism
+            is only available when they are deployed in <quote>integrated</quote>
+            mode.</emphasis> For Collection Readers and Initializers, it always is
+            available.</para>
+          
+          <para>The content of this element is identical to the component descriptor for
+            specifying parameters (in the case where no parameter groups are
+            specified)<footnote><para>An earlier UIMA version required these to have a
+            suffix of <quote>_p</quote>, e.g., <quote>string_p</quote>. This is no
+            longer required, but this format is accepted, also, for backward
+            compatibility.</para></footnote>. Here is an example:
+            
+            
+            <programlisting><![CDATA[<configurationParameterSettings>
+  <nameValuePair>
+    <name>CivilianTitles</name>
+    <value>
+      <array>
+        <string>Mr.</string>
+        <string>Ms.</string>
+        <string>Mrs.</string>
+        <string>Dr.</string>
+      </array>  
+    </value>
+  </nameValuePair>
+  ...
+</configurationParameterSettings>]]></programlisting></para>
+          
+        </section>
+        
+        <section
+          id="&tp;descriptor.cas_processors.individual.sofa_name_mappings">
+          <title>&lt;sofaNameMappings&gt; Element</title>
+          
+          <para>This optional element provides a mapping from defined Sofa names in the
+            component, or the default Sofa name (if the component does not declare any Sofa
+            names). The form of this element is:
+            
+            
+            <programlisting>&lt;sofaNameMappings&gt;
+  &lt;sofaNameMapping cpeSofaName="a_CPE_name"
+                   componentSofaName="a_component_Name"/&gt;
+  ...
+&lt;/sofaNameMappings&gt;</programlisting></para>
+          
+          <para>There can be any number of<literal>
+            &lt;sofaNameMapping&gt;</literal> elements contained in the
+            <literal>&lt;sofaNameMappings&gt;</literal> element. The
+            <literal>componentSofaName</literal> attribute is optional; leave it out to
+            specify a mapping for the <literal>_InitialView</literal> - that is, for
+            Single-View components.</para>
+          
+        </section>
+        
+        <section id="&tp;descriptor.cas_processors.run_in_separate_process">
+          <title>&lt;runInSeparateProcess&gt; Element</title>
+          
+          <para>The <literal>&lt;runInSeparateProcess&gt;</literal> element is
+            mandatory for <literal>local</literal> CAS Processors, but should not appear
+            for <literal>remote</literal> or <literal>integrated</literal> CAS
+            Processors. It enables the CPM to create external processes using the provided
+            runtime environment. Applications launched this way communicate with the CPM
+            using the Vinci protocol and connectivity is enabled by a local instance of the
+            VNS that the CPM manages. Since communication is based on Vinci, the application
+            need not be implemented in Java. Any language for which Vinci provides support
+            may be used to create an application, and the CPM will seamlessly communicate
+            with it. The overall structure of this element is:
+            
+            
+            <programlisting><![CDATA[<runInSeparateProcess>
+    <exec dir="[String]" executable="[String]">
+        <env key="[String]" value ="[String]"/>
+        ...
+        <arg>[String]</arg>
+        ...
+    </exec>
+</runInSeparateProcess>]]></programlisting></para>
+          
+          <para>The <literal>&lt;exec&gt;</literal> element provides information
+            about how to execute the referenced CAS Processor. Two attributes are defined
+            for the <literal>&lt;exec&gt;</literal> element. The
+            <literal>dir</literal> attribute is currently not used &ndash; it is reserved
+            for future functionality. The <literal>executable</literal> attribute
+            specifies the actual Vinci service executable that will be run by the CPM, e.g.,
+            <literal>java</literal>, a batch script, an application (.exe), etc. The
+            executable must be specified with a fully qualified path, or be found in the
+            <literal>PATH</literal> of the CPM.</para>
+          
+          <para>The <literal>&lt;exec&gt;</literal> element has two elements within it
+            that define parameters used to construct the command line for executing the CAS
+            Processor. These elements must be listed in the order in which they should be
+            defined for the CAS Processor.</para>
+          
+          <para>The optional <literal>&lt;env&gt;</literal> element is used to set an
+            environment variable. The variable <literal>key</literal> will be set to
+            <literal>value</literal>. For example,
+            
+            
+            <programlisting>&lt;env key="CLASSPATH" value="C:Javalib"/&gt;</programlisting>
+            will set the environment variable <literal>CLASSPATH</literal> to the value
+            <literal>C:Javalib</literal>. The <literal>&lt;env&gt;</literal>
+            element may be repeated to set multiple environment variables. All of the
+            key/value pairs will be added to the environment by the CPM prior to launching the
+            executable.</para>
+          <note><para>The CPM actually adds ALL system environment variables when it
+          launches the program. It queries the Operating System for its current system
+          variables and one by one adds them to the program&apos;s process
+          configuration.</para></note>
+          
+          <para>The <literal>&lt;arg&gt;</literal> element is used to specify arbitrary
+            string arguments that will appear on the command line when the CPM runs the
+            command specified in the <literal>executable</literal> attribute.</para>
+          
+          <para>For example, the following would be used to invoke the UIMA Java
+            implementation of the Vinci service wrapper on a Java CAS Processor:
+            
+            
+            <programlisting><![CDATA[<runInSeparateProcess>
+    <exec executable="java">
+        <arg>-DVNS_HOST=localhost</arg> 
+        <arg>-DVNS_PORT=9099</arg>
+        <arg>org.apache.uima.reference_impl.analysis_engine.service.
+vinci.VinciAnalysisEngineService_impl</arg> 
+        <arg>C:uimadescdeployCasProcessor.xml</arg>
+    </exec>
+<runInSeparateProcess>]]></programlisting></para>
+          
+          <para>This will cause the CPM to run the following command line when starting the
+            CAS Processor:
+            
+            
+            <programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9099 
+  org.apache.uima.reference_impl.analysis_engine.service.vinci.\\
+              VinciAnalysisEngineService_impl 
+  C:uimadescdeployCasProcessor.xml</programlisting></para>
+          
+          <para>The first argument specifies that the Vinci Naming Service is running on the
+            <literal>localhost</literal>. The second argument specifies that the Vinci
+            Naming Service port number is <literal>9099</literal>. The third argument
+            (split over 2 lines in this documentation) 
+            identifies the UIMA implementation of the Vinci service wrapper. This class
+            contains the <literal>main</literal> method that will execute. That main
+            method in turn takes a single argument &ndash; the filename for the CAS Processor
+            service deployment descriptor. Thus the last argument identifies the Vinci
+            service deployment descriptor file for the CAS Processor. Since this is the same
+            descriptor file specified earlier in the
+            <literal>&lt;descriptor&gt;</literal> element, the string
+            <literal>${descriptor}</literal> can be used to refer to the descriptor,
+            e.g.:
+            
+            
+            <programlisting>&lt;arg&gt;${descriptor}&lt;/arg&gt;</programlisting></para>
+          
+          <para>The CPM will expand this out to the service deployment descriptor file
+            referenced in the <literal>&lt;descriptor&gt;</literal> element.</para>
+          
+        </section>
+        
+        <section
+          id="&tp;descriptor.cas_processors.individual.deployment_parameters">
+          <title>&lt;deploymentParameters&gt; Element</title>
+          
+          <para>The <literal>&lt;deploymentParameters&gt;</literal> element defines
+            a number of deployment parameters that control how the CPM will interact with the
+            CAS Processor. This element has the following overall form:
+            
+            
+            <programlisting>&lt;deploymentParameters&gt;
+    &lt;parameter name="[String]" value="..." type="string|integer" /&gt; 
+    ...
+&lt;/deploymentParameters&gt;</programlisting></para>
+          
+          <para>The <literal>name</literal> attribute identifies the parameter, the
+            <literal>value</literal> attribute specifies the value that will be assigned
+            to the parameter, and the <literal>type</literal> attribute indicates the
+            type of the parameter, either <literal>string</literal> or
+            <literal>integer</literal>. The available parameters include:
+            
+            <variablelist>
+              
+              <varlistentry>
+                <term>service-access</term>
+                <listitem><para>string parameter whose value must be
+                  <quote>exclusive</quote>, if present. This parameter is only
+                  effective for remote deployments. It modifies the Vinci service
+                  connections to be preallocated and dedicated, one service instance per
+                  pipe-line. It is only relevant for non-Integrated deployement modes. If
+                  there are fewer services instances that are available (and alive &ndash;
+                  responding to a <quote>ping</quote> request) than there are pipelines,
+                  the number of pipelines (the number of concurrent threads) is reduced to
+                  match the number of available instances. If not specified, the VNS is
+                  queried each time a service is needed, and a <quote>random</quote>
+                  instance is assigned from the pool of available instances. If a services
+                  dies during processing, the CPM will use its normal error handling
+                  procedures to attempt to reconnect. The number of attempts is specified
+                  in the CPE descriptor for each Cas Processor using the
+                  <literal>&lt;maxConsecutiveRestarts value="10"
+                  action="kill-pipeline"
+                  waitTimeBetweenRetries="50"/&gt;</literal> xml element. The
+                  <quote>value</quote> attribute is the number of reconnection tries;
+                  the <quote>action</quote> says what to do if the retries exceed the
+                  limit. The <quote>kill-pipeline</quote> action stops the pipeline
+                  that was associated with the failing service (other pipelines will
+                  continue to work). The CAS in process within a killed pipeline will be
+                  dropped. These events are communicated to the application using the
+                  normal event listener mechanism. The
+                  <literal>waitTimeBetweenRetries</literal> says how many
+                  milliseconds to wait inbetween attempts to reconnect.</para>
+                  </listitem>
+              </varlistentry>
+              
+              <varlistentry>
+                <term>vnsHost</term>
+                <listitem><para>(Deprecated) string parameter specifying the VNS host,
+                  e.g., <literal>localhost</literal> for local CAS Processors, host
+                  name or IP address of VNS host for remote CAS Processors. This parameter is
+                  deprecated; use the parameter specification instead inside the Vinci
+                  <emphasis>Service Client Descriptor</emphasis>, if needed. It is
+                  ignored for integrated and local deployments. If present, for remote
+                  deployments, it specifies the VNS Host to use, unless that is specified in
+                  the Vinci <emphasis>Service Client Descriptor</emphasis>.</para>
+                  </listitem>
+              </varlistentry>
+              
+              <varlistentry>
+                <term>vnsPort</term>
+                <listitem><para>(Deprecated) integer parameter specifying the VNS port
+                  number. This parameter is deprecated; use the parameter specification
+                  instead inside the Vinci <emphasis>Service Client
+                  Descriptor,</emphasis> if needed. It is ignored for integrated and
+                  local deployments. If present, for remote deployments, it specifies the
+                  VNS Port number to use, unless that is specified in the Vinci
+                  <emphasis>Service Client Descriptor.</emphasis></para>
+                  </listitem>
+              </varlistentry>
+            </variablelist></para>
+          
+          <para>For example, the following parameters might be used with a CAS Processor
+            deployed in local mode:
+            
+            
+            <programlisting>&lt;deploymentParameters&gt;
+  &lt;parameter name="service-access" value="exclusive" type="string"/&gt; 
+&lt;/deploymentParameters&gt;</programlisting></para>
+          
+        </section>
+        
+        <section id="&tp;descriptor.cas_processors.individual.filter">
+          <title>&lt;filter&gt; Element</title>
+          
+          <para>The &lt;filter&gt; element is a required element but currently should be
+            left empty. This element is reserved for future use.</para>
+          
+        </section>
+        
+        <section id="&tp;descriptor.cas_processors.individual.error_handling">
+          <title>&lt;errorHandling&gt; Element</title>
+          
+          <para>The mandatory <literal>&lt;errorHandling&gt;</literal> element
+            defines error and restart policies for the CAS Processor. Each CAS Processor may
+            define different actions in the event of errors and restarts. The CPM monitors
+            and logs errant behaviors and attempts to recover the component based on the
+            policies specified in this element.</para>
+          
+          <para>There are two kinds of faults:
+            
+            <orderedlist><listitem><para>One kind only occurs with non-integrated CAS
+              Processors &ndash; this fault is either a timeout attempting to launch or
+              connect to the non-integrated component, or some other kind of connection
+              related exception (for instance, the network connection might timeout or get
+              reset).</para></listitem>
+              
+              <listitem><para>The other kind happens when the CAS Processor component (an
+                Annotator, for example) throws any kind of exception. This kind may occur
+                with any kind of deployment, integrated or not. </para></listitem>
+              </orderedlist></para>
+          
+          <para>The &lt;errorHandling&gt; has specifications for each of these kinds of
+            faults. The format of this element is:
+            
+            
+            <programlisting><![CDATA[<errorHandling>
+  <maxConsecutiveRestarts action="continue|disable|terminate"
+                           value="[Number]"/>
+  <errorRateThreshold action="continue|disable|terminate" value="[Rate]"/>
+  <timeout max="[Number]"/>
+</errorHandling>]]></programlisting></para>
+          
+          <para>The mandatory <literal>&lt;maxConsecutiveRestarts&gt;</literal>
+            element applies only to faults of the first kind, and therefore, only applies to
+            non-integrated deployments. If such a fault occurs, a retry is attempted, up to
+            <literal>value="[Number]"</literal> of times. This retry resets the
+            connection (if one was made) and attempts to reconnect and perhaps re-launch
+            (see below for details). The original CAS (not a partially updated one) is sent to
+            the CAS Processor as part of the retry, once the deployed component has been
+            successfully restarted or reconnected to.</para>
+          
+          <para>The <literal>action</literal> attribute specifies the action to take
+            when the threshold specified by the <literal>value="[Number]"</literal> is
+            exceeded. The possible actions are:
+            
+            <variablelist>
+              <varlistentry>
+                <term>continue</term>
+                <listitem><para>skip any further processing for this CAS by this CAS
+                  Processor, and pass the CAS to the next CAS Processor in the Pipeline.
+                  </para>
+                  <para>The <quote>restart</quote> action is done, because it is needed
+                    for the next CAS.</para>
+                  
+                  <para>If the <literal>dropCasOnException="true"</literal>, the CPM
+                    will NOT pass the CAS to the next CAS Processor in the chain. Instead, the
+                    CPM will abort processing of this CAS, release the CAS back to the CAS
+                    Pool and will process the next CAS in the queue.</para>
+                  
+                  <para>The counter counting the restarts toward the threshold is only
+                    reset after a CAS is successfully processed.</para></listitem>
+              </varlistentry>
+              
+              <varlistentry>
+                <term>disable</term>
+                <listitem><para>the current CAS is handled just as in the
+                  <literal>continue</literal> case, but in addition, the CAS Processor
+                  is marked so that its <emphasis>process()</emphasis> method will not be
+                  called again (i.e., it will be <quote>skipped</quote> for future
+                  CASes)</para></listitem>
+              </varlistentry>
+              
+              <varlistentry>
+                <term>terminate</term>
+                <listitem><para>the CPM will terminate all processing and exit.</para>
+                  </listitem>
+              </varlistentry>
+            </variablelist></para>
+          
+          <para>The definition of an error for the
+            <literal>&lt;maxConsecutiveRestarts&gt;</literal> element differs
+            slightly for each of the three CAS Processor deployment modes:
+            <variablelist>
+              <varlistentry>
+                <term>local</term>
+                <listitem><para>Local CAS Processors experience two general error
+                  types:
+                  <itemizedlist>
+                    <listitem><para>launch errors &ndash; errors associated with
+                      launching a process</para></listitem>
+                    <listitem><para>processing errors &ndash; errors associated with
+                      sending Vinci commands to the process</para></listitem>
+                  </itemizedlist></para>
+                  
+                  <para>A launch error is defined by a failure of the process to
+                    successfully register with the local VNS within a default time window.
+                    The current timeout is 15 minutes. Multiple local CAS Processors are
+                    launched sequentially, with a subsequent processor launched
+                    immediately after its previous processor successfully registers
+                    with the VNS.</para>
+                  
+                  <para>A processing error is detected if a connection to the CAS Processor
+                    is lost or if the processing time exceeds a specified timeout
+                    value.</para>
+                  
+                  <para>For local CAS Processors, the
+                    &lt;maxConsecutiveRestarts&gt; element specifies the number of
+                    consecutive attempts made to launch the CAS Processor at CPM startup or
+                    after the CPM has lost a connection to the CAS Processor.</para>
+                  </listitem>
+              </varlistentry>
+              
+              <varlistentry>
+                <term>remote</term>
+                <listitem><para>For remote CAS Processors, the
+                  &lt;maxConsecutiveRestarts&gt; element applies to errors from
+                  sending Vinci commands. An error is detected if a connection to the CAS
+                  Processor is lost, or if the processing time exceeds the timeout value
+                  specified in the &lt;timeout&gt; element (see below).</para>
+                  </listitem>
+              </varlistentry>
+              
+              <varlistentry>
+                <term>integrated</term>
+                <listitem><para>Although mandatory, the
+                  &lt;maxConsecutiveRestarts&gt; element is NOT used for integrated CAS
+                  Processors, because Integrated CAS Processors are not
+                  re-instantiated/restarted on exceptions. This setting is ignored by
+                  the CPM for Integrated CAS Processors but it is required. Future version
+                  of the CPM will make this element mandatory for remote and local CAS
+                  Processors only.</para></listitem>
+              </varlistentry>
+              
+            </variablelist></para>
+          
+          <para>The mandatory <literal>&lt;errorRateThreshold&gt;</literal> element
+            is used for all faults &ndash; both those above, and exceptions thrown by the CAS
+            Processor itself. It specifies the number of retries for exceptions thrown by
+            the CAS Processor itself, a maximum error rate, and the corresponding action to
+            take when this rate is exceeded. The <literal>value</literal> attribute
+            specifies the error rate in terms of errors per sample size in the form
+            <quote><literal>N/M</literal></quote>, where <literal>N</literal> is the
+            number of errors and <literal>M</literal> is the sample size, defined in terms
+            of the number of documents.</para>
+          
+          <para>The first number is used also to indicate the maximum number of retries. If
+            this number is less than the <literal>&lt;maxConsecutiveRestarts
+            value="[Number]"&gt;, </literal>it will override, reducing the number of
+            <quote>restarts</quote> attempted. A retry is done only if the
+            <literal>dropCasOnException </literal>is false. If it is set to true, no retry
+            occurs, but the error is counted.</para>
+          
+          <para>When the number of counted errors exceeds the sample size, an action
+            specified by the <literal>action</literal> attribute is taken. The possible
+            actions and their meaning are the same as described above for the
+            <literal>&lt;maxConsecutiveRestarts&gt;</literal> element:
+            <itemizedlist spacing="compact">
+              <listitem><para><literal>continue</literal></para></listitem>
+              <listitem><para><literal>disable</literal></para></listitem>
+              <listitem><para><literal>terminate</literal></para></listitem>
+            </itemizedlist></para>
+         
+          <para>The <literal>dropCasOnException="true"</literal> attribute of the
+            <literal>&lt;casProcessors&gt;</literal> element modifies the action
+            taken for continue and disable, in the same manner as above. For example:
+            
+            
+            <programlisting>&lt;errorRateThreshold value="3/1000" action="disable"/&gt;</programlisting>
+            specifies that each error thrown by the CAS Processor itself will be retried up to
+            3 times (if <literal>dropCasOnException</literal> is false) and the CAS
+            Processor will be disabled if the error rate exceeds 3 errors in 1000
+            documents.</para>
+          
+          <para>If a document causes an error and the error rate threshold for the CAS
+            Processor is not exceeded, the CPM increments the CAS Processor&apos;s error
+            count and retries processing that document (if
+            <literal>dropCasOnException</literal> is false). The retry means that the
+            CPM calls the CAS Processor&apos;s process() method again, passing in as an
+            argument the same CAS that previously caused an exception.</para>
+          <note><para>The CPM does not attempt to rollback any partial changes that may have
+          been applied to the CAS in the previous process() call. </para></note>
+          
+          <para>Errors are accumulated across documents. For example, assume the error
+            rate threshold is <literal>3/1000</literal>. The same document may fail three
+            times before finally succeeding on the fourth try, but the error count is now 3. If
+            one more error occurs within the current sample of 1000 documents, the error rate
+            threshold will be exceeded and the specified action will be taken. If no more
+            errors occur within the current sample, the error counter is reset to 0 for the
+            next sample of 1000 documents.</para>
+          
+          <para>The <literal>&lt;timeout&gt;</literal> element is a mandatory element.
+            Although mandatory for all CAS Processors, this element is only relevant for
+            local and remote CAS Processors. For integrated CAS Processors, this element is
+            ignored. In the current CPM implementation the integrated CAS Processor
+            process() method is not subject to timeouts.</para>
+          
+          <para>The <literal>max</literal> attribute specifies the maximum amount of
+            time in milliseconds the CPM will wait for a process() method to complete When
+            exceeded, the CPM will generate an exception and will treat this as an error
+            subject to the threshold defined in the
+            <literal>&lt;errorRateThreshold&gt;</literal> element above, including
+            doing retries.</para>
+          
+          <section
+            id="&tp;descriptor.cas_processors.individual.error_handling.timeout_retry_action">
+            <title>Retry action taken on a timeout</title>
+            
+            <para>The action taken depends on whether the CAS Processor is local (managed)
+              or remote (unmanaged). Local CAS Processors (which are services) are killed
+              and restarted, and a new connection to them is established. For remote CAS
+              Processors, the connection to them is dropped, and a new connection is
+              reestablished (which may actually connect to a different instance of the
+              remote services, if it has multiple instances).</para>
+          </section>
+        </section>
+        
+        <section id="&tp;descriptor.cas_processors.individual.checkpoint">
+          <title>&lt;checkpoint&gt; Element</title>
+          
+          <para>The <literal>&lt;checkpoint&gt;</literal> element is an optional
+            element used to improve the performance of CAS Consumers. It has a single
+            attribute, <literal>batch</literal>, which specifies the number of CASes in a
+            batch, e.g.:
+            
+            
+            <programlisting>&lt;checkpoint batch="1000"&gt;</programlisting></para>
+          
+          <para>sets the batch size to 1000 CASes. The batch size is the interval used to mark a
+            point in processing requiring special handling. The CAS Processor&apos;s
+            <literal>batchProcessComplete()</literal> method will be called by the CPM
+            when this mark is reached so that the processor can take appropriate action. This
+            mark could be used as a mechanism to buffer up results in CAS Consumers and perform
+            time-consuming operations, such as check-pointing, that should not be done on a
+            per-document basis.</para>
+          
+        </section>
+      </section>
+    </section>
+    
+    <section id="&tp;descriptor.operational_parameters">
+      <title>CPE Operational Parameters</title>
+      
+      <para>The parameters for configuring the overall CPE and CPM are specified in the
+        <literal>&lt;cpeConfig&gt;</literal> section. The overall format of this
+        section is:
+        
+        
+        <programlisting><![CDATA[<cpeConfig>
+  <startAt>[NumberOrID]</startAt>
+
+  <numToProcess>[Number]</numToProcess>
+
+  <outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" />
+
+  <checkpoint file="[File]" time="[Number]" batch="[Number]"/>
+
+  <timerImpl>[ClassName]</timerImpl>
+
+  <deployAs>vinciService|interactive|immediate|single-threaded
+  </deployAs>
+
+</cpeConfig>]]></programlisting></para>
+      
+      <para>This section of the CPE descriptor allows for defining the starting entity, the
+        number of entities to process, a checkpoint file and frequency, a pluggable timer, an
+        optional output queue implementation, and finally a mode of operation. The mode of
+        operation determines how the CPM interacts with users and other systems.</para>
+      
+      <para>The <literal>&lt;startAt&gt;</literal> element is an optional argument. It
+        defines the starting entity in the collection at which the CPM should start
+        processing.</para>
+      
+      <para>The implementation in the CPM passes this argument to the Collection Reader
+        as the value of the parameter <quote><literal>startNumber</literal></quote>.
+        The CPM does not do anything else with this parameter; in particular, the CPM has no
+        ability to skip to a specific document - that function, if available, is only provided
+        by a particular Collection Reader implementation.</para>
+      
+      <para>If the <literal>&lt;startAt&gt;</literal> element is used, the Collection
+        Reader descriptor must define a single-valued configuration parameter with the
+        name <literal>startNumber</literal>. It can declare this value to be of any type;
+        the value passed in this XML element must be convertible to that type.</para>
+      
+      <para>A typical use is to declare this to be an integer type, and to pass the sequential
+        document number where processing should start. An alternative implementation
+        might take a specific document ID; the collection reader could search through its
+        collection until it reaches this ID and then start there.</para>
+      
+      <para>This parameter will only make sense if the particular collection reader is
+        implemented to use the <literal>startNumber</literal> configuration
+        parameter.</para>
+      
+      <para>The <literal>&lt;numToProcess&gt;</literal> element is an optional
+        element. It specifies the total number of entities to process. Use -1 to indicate ALL.
+        If not defined, the number of entities to process will be taken from the Collection
+        Reader configuration. If present, this value overrides the Collection Reader
+        configuration.</para>
+      
+      <para>The <literal>&lt;outputQueue&gt;</literal> element is an optional element.
+        It enables plugging in a custom implementation for the Output Queue. When omitted,
+        the CPM will use a default output queue that is based on First-in First-out (FIFO)
+        model.</para>
+      
+      <para>The UIMA SDK provides a second implementation for the Output Queue that can be
+        plugged in to the CPM, named <quote>
+        <literal>org.apache.uima.collection.impl.cpm.engine.SequencedQueue</literal>
+        </quote>.</para>
+      
+      <para>This implementation supports handling very large documents that are split into
+        <quote>chunks</quote>; it provides a delivery mechanism that insures the
+        sequential order of the chunks using information carried in the CAS metadata. This
+        metadata, which is required for this implementation to work correctly, must be added
+        as an instance of a Feature Structure of type
+        <literal>org.apache.es.tt.DocumentMetaData</literal> and referred to by an
+        additional feature named <literal>esDocumentMetaData</literal> in the special
+        instance of <literal>uima.tcas.DocumentAnnotation</literal> that is
+        associated with the CAS. This is usually done by the Collection Reader; the instance
+        contains the following features:
+        
+        <variablelist>
+          <varlistentry>
+            <term>sequenceNumber</term>
+            <listitem><para>[Number] the sequential number of a chunk, starting at 1. If
+              not a chunk (i.e. complete document), the value should be 0.</para>
+              </listitem>
+          </varlistentry>
+          <varlistentry>
+            <term>documentId</term>
+            <listitem><para>[Number] current document id. Chunks belonging to the same
+              document have identical document id.</para></listitem>
+          </varlistentry>
+          <varlistentry>
+            <term>isCompleted</term>
+            <listitem><para>[Number] 1 if the chunk is the last in a sequence, 0
+              otherwise.</para></listitem>
+          </varlistentry>
+          <varlistentry>
+            <term>url</term>
+            <listitem><para>[String] document url.</para></listitem>
+          </varlistentry>
+          <varlistentry>
+            <term>throttleID</term>
+            <listitem><para>[String] special attribute currently used by
+              OmniFind.</para></listitem>
+          </varlistentry>
+        </variablelist></para>
+      
+      <para>This implementation of a sequenced queue supports proper sequencing of CASes in
+        CPM deployments that use document chunking. Chunking is a technique of splitting
+        large documents into pieces to reduce overall memory consumption. Chunking does not
+        depend on the number of CASes in the CAS Pool. It works equally well with one or more
+        CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work
+        Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a
+        CAS is released back to the pool by the processing threads. A document may be split into
+        1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the
+        document correctly, the CAS Consumer can depend on receiving the chunks in the same
+        sequential order that the chunks were <quote>produced</quote>, when this
+        sequenced queue implementation is used. To plug in this sequenced queue to the CPM use
+        the following specification:
+        
+        
+        <programlisting>&lt;outputQueue dequeueTimeout="100000" queueClass=
+"org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/&gt;</programlisting>
+        
+        where the mandatory <literal>queueClass</literal> attribute defines the name of
+        the class and the second mandatory attribute, <literal>dequeueTimeout</literal>
+        specifies the maximum number of milliseconds to wait for the expected chunk.</para>
+      
+      <note><para>The value for this timeout must be carefully determined to avoid
+      excessive occurrences of timeouts. Typically, the size of a chunk and the type of
+      analysis being done are the most important factors when deciding on the value for the
+      timeout. The larger the chunk and the more complicated analysis, the more time it takes
+      for the chunk to go from source to sink. You may specify 0, in which case, the timeout is 
+      disabled - i.e., it is equivalent to an infinitely long timeout.</para></note>
+      
+      <para>If the chunk doesn&apos;t arrive in the configured time window, the entire
+        document is presumed to be invalid and the CAS is dropped from further processing.
+        This action occurs regardless of any other error action specification. The
+        SequencedQueue invalidate the document, adding the offending document&apos;s
+        metadata to a local cache of invalid documents. </para>
+      
+      <para>If the time out occurs, the CPM notifies all registered listeners (see <olink
+          targetdoc="&uima_docs_tutorial_guides;"
+          targetptr="ugr.tug.cpe.using_listeners"/>) by calling
+        entityProcessComplete(). As part of this call, the SequencedQueue will pass null
+        instead of a CAS as the first argument, and a special exception &ndash;
+        CPMChunkTimeoutException. The reason for passing null as the first argument is
+        because the time out occurs due to the fact that the chunk has not been received in the
+        configured timeout window, so there is no CAS available when the timeout event
+        occurs.</para>
+      
+      <para>The CPMChunkTimeoutException object includes an API that allows the listener
+        to retrieve the offending document id as well as the other metadata attributes as
+        defined above. These attributes are part of each chunk&apos;s metadata and are added
+        by the Collection Reader.</para>
+      
+      <para>Each chunk that SequencedQueue works on is subjected to a test to determine if the
+        chunk belongs to an invalid document. This test checks the chunk&apos;s metadata
+        against the data in the local cache. If there is a match, the chunk is dropped. This
+        check is only performed for chunks and complete documents are not subject to this
+        check.</para>
+      
+      <para>If there is an exception during the processing of a chunk, the CPM sends a
+        notification to all registered listeners. The notification includes the CAS and an
+        exception. When the listener notification is completed, the CPM also sends separate
+        notifications, containing the CAS, to the Artifact Producer and the
+        SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong
+        to an <quote>invalid</quote> document and also to deal with chunks that are
+        en-route, being processed by the processing threads.</para>
+      
+      <para>In response to the notification, the Artifact Producer will drop and release
+        back to the CAS Pool all CASes that belong to an <quote>invalid</quote> document.
+        Currently, there is no support in the CollectionReader&apos;s API to tell it to stop
+        generating chunks. The CollectionReader keeps producing the chunks but the
+        Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is
+        released back to the CAS Pool, the Artifact Producer sends notification to all
+        registered listeners. This notification includes the CAS and an exception &ndash;
+        SkipCasException.</para>
+      
+      <para>In response to the notification of an exception involving a chunk, the
+        SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of
+        <quote>invalid</quote> documents. All chunks de-queued from the OutputQueue and
+        belonging to <quote>invalid</quote> documents will be dropped and released back to
+        the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered
+        listeners. The notification includes the CAS and SkipCasException.</para>
+      
+      <para>The <literal>&lt;checkpoint&gt;</literal> element is an optional element.
+        It specifies a CPE checkpoint file, checkpoint frequency, and strategy for
+        checkpoints (time or count based). At checkpoint time, the CPM saves status
+        information and statistics to the checkpoint file. The checkpoint file is specified
+        in the <literal>file</literal> attribute, which has the same form as the
+        <literal>href</literal> attribute of the <literal>&lt;include&gt;</literal>
+        element described in <xref linkend="&tp;imports"/>. The
+        <literal>time</literal> attribute indicates that a checkpoint should be taken
+        every <literal>[Number]</literal> seconds, and the <literal>batch</literal>
+        attribute indicates that a checkpoint should be taken every
+        <literal>[Number]</literal> batches.</para>
+      
+      <para>The <literal>&lt;timerImpl&gt;</literal> element is optional. It is used to
+        identify a custom timer plug-in class to generate time stamps during the CPM
+        execution. The value of the element is a Java class name.</para>
+      
+      <para>The <literal>&lt;deployAs&gt;</literal> element indicates the type of CPM
+        deployment. Valid contents for this element include:
+        
+        <variablelist>
+          <varlistentry>
+            <term>vinciService</term>
+            <listitem><para>Vinci service exposing APIs for stop, pause, resume, and
+              getStats</para></listitem>
+          </varlistentry>
+          <varlistentry>
+            <term>interactive</term>
+            <listitem><para>provide command line menus (start, stop, pause,
+              resume)</para></listitem>
+          </varlistentry>
+          <varlistentry>
+            <term>immediate</term>
+            <listitem><para>run the CPM without menus or a service API</para></listitem>
+          </varlistentry>
+          <varlistentry>
+            <term>single-threaded</term>
+            <listitem><para>run the CPM in a single threaded mode. In this mode, the
+              Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline
+              are all running in one thread without the work queue and the output
+              queue.</para></listitem>
+          </varlistentry>
+        </variablelist></para>
+      
+    </section>
+    
+    <section id="&tp;descriptor.resource_manager_configuration">
+      <title>Resource Manager Configuration</title>
+      
+      <para>External resource bindings for the CPE may optionally be specified in an
+        element:
+        
+        
+        <programlisting>&lt;resourceManagerConfiguration href="..."/&gt;</programlisting></para>
+      
+      <para>For an introduction to external resources, refer to <olink
+          targetdoc="&uima_docs_tutorial_guides;"
+          targetptr="ugr.tug.aae.accessing_external_resource_files"/>.</para>
+      
+      <para>In the <literal>resourceManagerConfiguration</literal> element, the value
+        of the href attribute refers to another file that contains definitions and bindings
+        for the external resources used by the CPE. The format of this file is the same as the XML
+        snippet <olink targetdoc="&uima_docs_ref;"
+          targetptr="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings"/>
+        . For example, in a CPE containing an aggregate analysis engine with two annotators,
+        and a CAS Consumer, the following resource manager configuration file would bind
+        external resource dependencies in all three components to the same physical
+        resource:
+        
+        
+        <programlisting><![CDATA[<resourceManagerConfiguration>
+
+  <!-- Declare Resource -->
+
+  <externalResources>
+    <externalResource>
+      <name>ExampleResource</name>
+      <fileResourceSpecifier>
+        <fileUrl>file:MyResourceFile.dat</fileUrl>
+      </fileResourceSpecifier>
+    </externalResource>
+  </externalResources>
+
+  <!-- Bind component resource dependencies to ExampleResource -->
+
+  <externalResourceBindings>
+    <externalResourceBinding>
+      <key>MyAE/annotator1/myResourceKey</key>
+      <resourceName>ExampleResource</resourceName>
+    </externalResourceBinding>
+
+    <externalResourceBinding>
+      <key>MyAE/annotator2/someResourceKey</key>
+      <resourceName>ExampleResource</resourceName>
+    </externalResourceBinding>
+
+    <externalResourceBinding>
+      <key>MyCasConsumer/otherResourceKey</key>
+      <resourceName>ExampleResource</resourceName>
+    </externalResourceBinding>
+
+  </externalResourceBindings>
+
+</resourceManagerConfiguration>]]></programlisting></para>
+      
+      <para>In this example, <literal>MyAE</literal> and
+        <literal>MyCasConsumer</literal> are the names of the Analysis Engine and CAS
+        Consumer, as specified by the name attributes of the CPE&apos;s
+        <literal>&lt;casProcessor&gt;</literal> elements.
+        <literal>annotator1</literal> and <literal>annotator2</literal> are the
+        annotator keys specified within the Aggregate AE Descriptor, and
+        <literal>myResourceKey</literal>, <literal>someResourceKey</literal>, and
+        <literal>otherResourceKey</literal> are the keys of the resource dependencies
+        declared in the individual annotator and CAS Consumer descriptors.</para>
+      
+    </section>
+    
+    <section id="&tp;descriptor.example">
+      <title>Example CPE Descriptor</title>
+      
+      
+      <programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
+<cpeDescription>
+  <collectionReader>
+    <collectionIterator>
+      <descriptor>
+        <import location=
+           "../collection_reader/FileSystemCollectionReader.xml"/>
+      </descriptor>
+    </collectionIterator>
+  </collectionReader>
+  <casProcessors dropCasOnException="true" casPoolSize="1" 
+      processingUnitThreadCount="1">
+    <casProcessor deployment="integrated" 
+      name="Aggregate TAE - Name Recognizer and Person Title Annotator">
+      <descriptor>
+        <import location=
+           "../analysis_engine/NamesAndPersonTitles_TAE.xml"/>
+      </descriptor>
+      <deploymentParameters/>
+      <filter/>
+      <errorHandling>
+        <errorRateThreshold action="terminate" value="100/1000"/>
+                <maxConsecutiveRestarts action="terminate" value="30"/>
+                <timeout max="100000"/>
+      </errorHandling>
+      <checkpoint batch="1"/>
+    </casProcessor>
+    <casProcessor deployment="integrated" name="Annotation Printer">
+      <descriptor>
+        <import location="../cas_consumer/AnnotationPrinter.xml"/>
+      </descriptor>
+      <deploymentParameters/>
+      <filter/>
+      <errorHandling>
+        <errorRateThreshold action="terminate" value="100/1000"/>
+        <maxConsecutiveRestarts action="terminate" value="30"/>
+        <timeout max="100000"/>
+      </errorHandling>
+      <checkpoint batch="1"/>
+    </casProcessor>
+  </casProcessors>
+  <cpeConfig>
+    <numToProcess>1</numToProcess>
+    <deployAs>immediate</deployAs>
+    <checkpoint file="" time="3000"/>
+    <timerImpl/>
+  </cpeConfig>
+</cpeDescription>]]></programlisting>
+    </section>
+  
+</chapter>
\ No newline at end of file

Added: uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml
URL: http://svn.apache.org/viewvc/uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml?rev=941739&view=auto
==============================================================================
--- uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml (added)
+++ uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml Thu May  6 14:01:56 2010
@@ -0,0 +1,35 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd">
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements.  See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership.  The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License.  You may obtain a copy of the License at
+
+   http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied.  See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<book lang="en">
+  <title>UIMA References</title>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../target/docbook-shared/common_book_info_ibm_c.xml"/>
+
+  <toc/>
+  
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.javadocs.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.xml.component_descriptor.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.xml.cpe_descriptor.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.cas.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.jcas.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.pear.xml"/>
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.xmi.xml"/>
+</book>