You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2010/05/06 16:01:57 UTC
svn commit: r941739 [5/5] - in
/uima/uimaj/branches/mavenAlign/uima-docbook-references: ./ src/
src/docbook/ src/docbook/images/ src/docbook/images/references/
src/docbook/images/references/ref.cas/
src/docbook/images/references/ref.javadocs/ src/docbo...
Added: uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml
URL: http://svn.apache.org/viewvc/uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml?rev=941739&view=auto
==============================================================================
--- uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml (added)
+++ uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/ref.xml.cpe_descriptor.xml Thu May 6 14:01:56 2010
@@ -0,0 +1,1368 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+<!ENTITY imgroot "images/references/ref.xml.cpe_descriptor/">
+<!ENTITY tp "ugr.ref.xml.cpe_descriptor.">
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
+%uimaents;
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<chapter id="ugr.ref.xml.cpe_descriptor">
+ <title>Collection Processing Engine Descriptor Reference</title>
+ <titleabbrev>CPE Descriptor Reference</titleabbrev>
+
+ <para>A UIMA <emphasis>Collection Processing Engine</emphasis> (CPE) is a combination
+ of UIMA components assembled to analyze a collection of artifacts. A CPE is an
+ instantiation of the UIMA <emphasis>Collection Processing Architecture</emphasis>,
+ which defines the collection processing components, interfaces, and APIs. A CPE is
+ executed by a UIMA framework component called the <emphasis>Collection Processing
+ Manager</emphasis> (CPM), which provides a number of services for deploying CPEs,
+ running CPEs, and handling errors.</para>
+
+ <para>A CPE can be assembled programmatically within a Java application, or it can be
+ assembled declaratively via a CPE configuration specification, called a CPE
+ Descriptor. This chapter describes the format of the CPE Descriptor.</para>
+
+ <para>Details about the CPE, including its function, sub-components, APIs, and related
+ tools, can be found in <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.cpe"/>. Here we briefly summarize the CPE to define terms and
+ provide context for the later sections that describe the CPE Descriptor.</para>
+
+ <section id="&tp;overview">
+ <title>CPE Overview</title>
+
+ <figure id="&tp;overview.fig.runtime">
+ <title>CPE Runtime Overview</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.8in" format="PNG"
+ fileref="&imgroot;image002.png"/>
+ </imageobject>
+ <textobject><phrase>CPE Runtime Overview diagram</phrase></textobject>
+ </mediaobject>
+ </figure>
+
+ <para>An illustration of the CPE runtime is shown in <xref
+ linkend="&tp;overview.fig.runtime"/>. Some of the CPE components, such as the
+ <emphasis>queues</emphasis> and <emphasis>processing pipelines</emphasis>, are
+ internal to the CPE, but their behavior and deployment may be configured using the CPE
+ Descriptor. Other CPE components, such as the <emphasis>Collection
+ Reader</emphasis> and <emphasis>CAS Processors</emphasis>, are defined and
+ configured externally from the CPE and then plugged in to the CPE to create the overall
+ engine. The parts of a CPE are:
+
+ <variablelist>
+ <varlistentry>
+ <term>Collection Reader</term>
+ <listitem><para>understands the native data collection format and iterates
+ over the collection producing subjects of analysis</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>CAS Initializer<footnote><para>Deprecated</para></footnote>
+ </term>
+ <listitem><para>initializes a CAS with a subject of analysis</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Artifact Producer</term>
+ <listitem><para>asynchronously pulls CASes from the Collection Reader,
+ creates batches of CASes and puts them into the work queue</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Work Queue</term>
+ <listitem><para>shared queue containing batches of CASes queued by the Artifact
+ Producer for analysis by Analysis Engines</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>B1-Bn</term>
+ <listitem><para>individual batches containing 1 or more CASes</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>AE1-AEn</term>
+ <listitem><para>Analysis Engines arranged by a CPE descriptor</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Processing Pipelines</term>
+ <listitem><para>each pipeline runs in a separate thread and contains a
+ replicated set of the Analysis Engines running in the defined sequence</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Output Queue</term>
+ <listitem><para>holds batches of CASes with analysis results intended for CAS
+ Consumers</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>CAS Consumers</term>
+ <listitem><para>perform collection level analysis over the CASes and extract
+ analysis results, e.g., creating indexes or databases</para></listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </section>
+
+ <section id="&tp;notation">
+ <title>Notation</title>
+
+ <para>CPE Descriptors are XML files. This chapter uses an informal notation to specify
+ the syntax of CPE Descriptors.</para>
+
+ <para>The notation used in this chapter is:
+
+ <itemizedlist><listitem><para>An ellipsis (...) inside an element body indicates
+ that the substructure of that element has been omitted (to be described in another
+ section of this chapter). An example of this would be:
+
+
+ <programlisting><collectionReader>
+...
+</collectionReader></programlisting></para>
+ </listitem>
+
+ <listitem><para>An ellipsis immediately after an element indicates that the
+ element type may be repeated arbitrarily many times. For example:
+
+
+ <programlisting><parameter>[String]</parameter>
+<parameter>[String]</parameter>
+...</programlisting>
+ indicates that there may be arbitrarily many parameter elements in this
+ context.</para></listitem>
+
+ <listitem><para>An ellipsis inside an element means details of the attributes
+ associated with that element are defined later, e.g.:
+
+ <programlisting><casProcessor ...></programlisting></para>
+ </listitem>
+
+ <listitem><para>Bracketed expressions (e.g. <literal>[String]</literal>)
+ indicate the type of value that may be used at that location.</para></listitem>
+
+ <listitem><para>A vertical bar, as in <literal>true|false</literal>, indicates
+ alternatives. This can be applied to literal values, bracketed type names, and
+ elements. </para></listitem></itemizedlist></para>
+
+ <para>Which elements are optional and which are required is specified in prose, not in the
+ syntax definition.</para>
+
+ </section>
+
+ <section id="&tp;imports">
+ <title>Imports</title>
+
+ <para>As of version 2.2, a CPE Descriptor can use the same <literal>import</literal> mechanism
+ as other component descriptors. This allows referring to component
+ descriptors using either relative paths (resolved relative to the location of the CPE descriptor)
+ or the classpath/datapath. For details see <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor"/>.</para>
+
+ <para>The follwing older syntax is still supported, but <emphasis>not recommended</emphasis>:
+
+ <programlisting><![CDATA[<descriptor>
+ <include href="[URL or File]"/>
+</descriptor>]]></programlisting></para>
+
+ <para>The <literal>[URL or File]</literal> attribute is a URL or a filename for the descriptor of the
+ incorporated component. The argument is first attempted to be resolved as a URL.</para>
+
+ <para>
+ Relative paths in an <literal>include</literal> are resolved relative to the current working directory
+ (NOT the CPE descriptor location as is the case for <literal>import</literal>).
+ A filename relative to another directory can be specified using the <literal>CPM_HOME</literal>
+ variable, e.g.,
+ <programlisting><descriptor>
+ <include href="${CPM_HOME}/desc_dir/descriptor.xml"/>
+</descriptor></programlisting>
+
+ In this case, the value for the <literal>CPM_HOME</literal> variable must be
+ provided to the CPE by specifying it on the Java command line, e.g.,
+
+ <programlisting>java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</programlisting>
+
+ </para>
+
+ </section>
+
+ <section id="&tp;descriptor">
+ <title>CPE Descriptor Overview</title>
+
+ <para>A CPE Descriptor consists of information describing the following four main
+ elements.</para>
+
+ <orderedlist><listitem><para>The <emphasis>Collection Reader</emphasis>, which
+ is responsible for gathering artifacts and initializing the Common Analysis
+ Structure (CAS) used to support processing in the UIMA collection processing
+ engine.</para></listitem>
+
+ <listitem><para>The <emphasis>CAS Processors</emphasis>, responsible for
+ analyzing individual artifacts, analyzing across artifacts, and extracting
+ analysis results. CAS Processors include <emphasis>Analysis Engines</emphasis>
+ and <emphasis>CAS Consumers</emphasis>.</para></listitem>
+
+ <listitem><para>Operational parameters of the <emphasis>Collection Processing
+ Manager</emphasis> (CPM), such as checkpoint frequency and deployment
+ mode.</para></listitem>
+
+ <listitem><para>Resource Manager Configuration (optional). </para></listitem>
+ </orderedlist>
+
+ <para>The CPE Descriptor has the following high level skeleton:
+
+
+ <programlisting><![CDATA[<?xml version="1.0"?>
+<cpeDescription>
+ <collectionReader>
+...
+ </collectionReader>
+ <casProcessors>
+...
+ </casProcessors>
+ <cpeConfig>
+...
+ </cpeConfig>
+ <resourceManagerConfiguration>
+...
+ </resourceManagerConfiguration>
+</cpeDescription>]]></programlisting></para>
+
+ <para>Details of each of the four main elements are described in the sections that
+ follow.</para>
+ </section>
+ <section id="&tp;descriptor.collection_reader">
+ <title>Collection Reader</title>
+
+ <para>The <literal><collectionReader></literal> section identifies the
+ Collection Reader and optional CAS Initializer that are to be used in the CPE. The
+ Collection Reader is responsible for retrieval of artifacts from a collection
+ outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2)
+ is responsible for initializing the CAS with the artifact.</para>
+
+ <para>A Collection Reader may initialize the CAS itself, in which case it does not
+ require a CAS Initializer. This should be clearly specified in the documentation for
+ the Collection Reader. Specifying a CAS Initializer for a Collection Reader that
+ does not make use of a CAS Initializer will not cause an error, but the specified CAS
+ Initializer will not be used.</para>
+
+ <para>The complete structure of the <literal><collectionReader></literal>
+ section is:
+
+
+ <programlisting><![CDATA[<collectionReader>
+ <collectionIterator>
+ <descriptor>
+ <import ...> | <include .../>
+ </descriptor>
+ <configurationParameterSettings>...</configurationParameterSettings>
+ <sofaNameMappings>...</sofaNameMappings>
+ </collectionIterator>
+ <casInitializer>
+ <descriptor>
+ <import ...> | <include .../>
+ </descriptor>
+ <configurationParameterSettings>...</configurationParameterSettings>
+ <sofaNameMappings>...</sofaNameMappings>
+ </casInitializer>
+</collectionReader>]]></programlisting></para>
+
+ <para>The <literal><collectionIterator></literal> identifies the
+ descriptor for the Collection Reader, and the <literal><casInitializer>
+ </literal>identifies the descriptor for the CAS Initializer. The format and
+ details of the Collection Reader and CAS Initializer descriptors are described in
+ <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader"/>
+ . The <literal><configurationParameterSettings> </literal>and the
+ <literal><sofaNameMappings></literal> elements are described in the next
+ section.</para>
+
+ <section id="&tp;descriptor.collection_reader.error_handling">
+ <title>Error handling for Collection Readers</title>
+
+ <para>The CPM will abort if the Collection Reader throws a large number of
+ consecutive exceptions (default = 100). This default can by changed by using the
+ Java initialization parameter <literal>-DMaxCRErrorThreshold
+ xxx.</literal></para>
+ </section>
+ </section>
+
+ <section id="&tp;descriptor.cas_processors">
+ <title>CAS Processors</title>
+
+ <para>The <literal><casProcessors></literal> section identifies the
+ components that perform the analysis on the input data, including CAS analysis
+ (Analysis Engines) and analysis results extraction (CAS Consumers). The CAS
+ Consumers may also perform collection level analysis, where the analysis is
+ performed (or aggregated) over multiple CASes. The basic structure of the CAS
+ Processors section is:
+
+
+ <programlisting><![CDATA[<casProcessors
+ dropCasOnException="true|false"
+ casPoolSize="[Number]"
+ processingUnitThreadCount="[Number]">
+
+ <casProcessor ...>
+ ...
+ </casProcessor>
+
+ <casProcessor ...>
+ ...
+ </casProcessor>
+ ...
+</casProcessors>]]></programlisting></para>
+
+ <para>The <literal><casProcessors></literal> section has two mandatory
+ attributes and one optional attribute that configure the characteristics of the CAS
+ Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which
+ defines the fixed number of CAS instances that the CPM will create and use during
+ processing. All CAS instances are maintained in a CAS Pool with a check-in and
+ check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader
+ and initialized with an initial subject of analysis. The CAS is checked-in into the
+ CAS Pool when it is completely processed, at the end of the processing chain. A larger
+ CAS Pool size will result in more memory being used by the CPM. CAS objects can be large
+ and care should be taken to determine the optimum size of the CAS Pool, weighing memory
+ tradeoffs with performance.</para>
+
+ <para>The second mandatory <literal><casProcessors></literal> attribute
+ is <literal>processingUnitThreadCount</literal>, which specifies the number of
+ replicated <emphasis>Processing Pipelines</emphasis>. Each Processing
+ Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits
+ each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline
+ contains one or more Analysis Engines invoked in a given sequence. If more than one
+ Processing Pipeline is specified, the CPM replicates instances of each Analysis
+ Engine defined in the CPE descriptor. Each Processing Pipeline thread runs
+ independently, consuming CASes from work queue and depositing CASes with analysis
+ results onto the output queue. On multiprocessor machines, multiple Processing
+ Pipelines can run in parallel, improving overall throughput of the CPM.</para>
+ <note><para>The number of Processing Pipelines should be equal to or greater than CAS
+ Pool size. </para></note>
+
+ <para>Elements in the pipeline (each represented by a <casProcessor> element)
+ may indicate that they do not permit multiple deployment in their Analysis Engine
+ descriptor. If so, even though multiple pipelines are being used, all CASes passing
+ through the pipelines will be routed through one instance of these marked Engines.
+ </para>
+
+ <para>The final, optional, <casProcessors> attribute is
+ <literal>dropCasOnException</literal>. It defines a policy that determines what
+ happens with the CAS when an exception happens during processing. If the value of this
+ attribute is set to true and an exception happens, the CPM will notify all registered
+ listeners of the exception (see <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.cpe.using_listeners"/>), clear the CAS and check the CAS
+ back into the CAS Pool so that it can be re-used. The presumption is that an exception
+ may leave the CAS in an inconsistent state and therefore that CAS should not be allowed
+ to move through the processing chain. When this attribute is omitted the CPM's
+ default is the same as specifying
+ <literal>dropCasOnException="false"</literal>.</para>
+
+ <section id="&tp;descriptor.cas_processors.individual">
+ <title>Specifying an Individual CAS Processor</title>
+
+ <para>The CAS Processors that make up the Processing Pipeline and the CAS Consumer
+ pipeline are specified with the <literal><casProcessor></literal>
+ entity, which appears within the <literal><casProcessors></literal>
+ entity. It may appear multiple times, once for each CAS Processor specified for
+ this CPE.</para>
+
+ <para>The order of the <literal><casProcessor></literal> entities with
+ the <literal><casProcessors></literal> section specifies the order in
+ which the CAS Processors will run. Although CAS Consumers are usually put at the end
+ of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS
+ Consumers.</para>
+
+ <para>The overall format of the <literal><casProcessor></literal> entity
+ is:
+
+
+ <programlisting><![CDATA[<casProcessor deployment="local|remote|integrated" name="[String]" >
+ <descriptor>
+ <import ...> | <include .../>
+ </descriptor>
+ <configurationParameterSettings>...</configurationParameterSettings>
+ <sofaNameMappings>...</sofaNameMappings>
+ <runInSeparateProcess>...</runInSeparateProcess>
+ <deploymentParameters>...</deploymentParameters>
+ <filter/>
+ <errorHandling>...</errorHandling>
+ <checkpoint batch="Number"/>
+</casProcessor>]]></programlisting></para>
+
+ <para>The <literal><casProcessor></literal> element has two mandatory
+ attributes, <literal>deployment</literal> and <literal>name</literal>. The
+ mandatory <literal>name</literal> attribute specifies a unique string
+ identifying the CAS Processor.</para>
+
+ <para>The mandatory <literal>deployment</literal> attribute specifies the CAS
+ Processor deployment mode. Currently, three deployment options are supported:
+
+ <variablelist>
+ <varlistentry>
+ <term>integrated</term>
+ <listitem><para>indicates <emphasis>integrated</emphasis> deployment
+ of the CAS Processor. The CPM deploys and collocates the CAS Processor in the
+ same process space as the CPM. This type of deployment is recommended to
+ increase the performance of the CPE. However, it is NOT recommended to
+ deploy annotators containing JNI this way. Such CAS Processors may cause a
+ fatal exception and force the JVM to exit without cleanup (bringing down the
+ CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed
+ this way.</para>
+ <para>The descriptor for an integrated deployment can, in fact, be a remote
+ service descriptor. When used this way, however, the CPM error recovery
+ options (see below) operate in the integrated mode, which means that many
+ of the retry options are not available.</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>remote</term>
+ <listitem><para>indicates <emphasis>non-managed</emphasis>
+ deployment of the CAS Processor. The CAS Processor descriptor referenced
+ in the <literal><descriptor></literal> element must be a Vinci
+ <emphasis>Service Client Descriptor</emphasis>, which identifies a
+ remotely deployed CAS Processor service (see <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.application.remote_services"/>). The CPM
+ assumes that the CAS Processor is already running as a remote service and
+ will connect to it using the URI provided in the client service descriptor.
+ The lifecycle of a remotely deployed CAS Processor is not managed by the CPM,
+ so appropriate infrastructure should be in place to start/restart such CAS
+ Processors when necessary. This deployment provides fault isolation and
+ is implementation (i.e., programming language) neutral.</para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>local</term>
+ <listitem><para>indicates <emphasis>managed</emphasis> deployment of
+ the CAS Processor. The CAS Processor descriptor referenced in the
+ <literal><descriptor></literal> element must be a Vinci
+ <emphasis>Service Deployment Descriptor</emphasis>, which configures
+ a CAS Processor for deployment as a Vinci service (see <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.application.remote_services"/>). The CPM
+ deploys the CAS Processor in a separate process and manages the life cycle
+ (start/stop) of the CAS Processor. Communication between the CPM and the
+ CAS Processor is done with Vinci. When the CPM completes processing, the
+ process containing the CAS Processor is terminated. This deployment mode
+ insulates the CPM from the CAS Processor, creating a more robust deployment
+ at the cost of a small communication overhead. On multiprocessor machines,
+ the separate processes may run concurrently and improve overall
+ throughput.</para></listitem>
+ </varlistentry>
+ </variablelist></para>
+
+ <para>A number of elements may appear within the
+ <literal><casProcessor></literal> element.</para>
+
+ <section id="&tp;descriptor.cas_processors.individual.descriptor">
+ <title><descriptor> Element</title>
+
+ <para>The <literal><descriptor></literal> element is mandatory. It
+ identifies the descriptor for the referenced CAS Processor using the syntax
+ described in <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor.aes"/>.
+
+ <itemizedlist spacing="compact"><listitem><para>For
+ <emphasis><literal>remote</literal></emphasis> CAS Processors, the
+ referenced descriptor must be a Vinci <emphasis>Service Client
+ Descriptor</emphasis>, which identifies a remotely deployed CAS Processor
+ service.</para></listitem>
+
+ <listitem><para>For <emphasis>local</emphasis> CAS Processors, the
+ referenced descriptor must be a Vinci <emphasis>Service Deployment
+ Descriptor</emphasis>.</para></listitem>
+
+ <listitem><para>For <emphasis>integrated</emphasis> CAS Processors,
+ the referenced descriptor must be an Analysis Engine Descriptor
+ (primitive or aggregate). </para></listitem></itemizedlist> </para>
+
+ <para>See <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.application.remote_services"/> for more
+ information on creating these descriptors and deploying services.</para>
+
+ </section>
+
+ <section
+ id="&tp;descriptor.cas_processors.individual.configuration_parameter_settings">
+ <title><configurationParameterSettings> Element</title>
+
+ <para>This element provides a way to override the contained Analysis
+ Engine's parameters settings. Any entry specified here must already be
+ defined; values specified replace the corresponding values for each
+ parameter. <emphasis role="bold-italic">For Cas Processors, this mechanism
+ is only available when they are deployed in <quote>integrated</quote>
+ mode.</emphasis> For Collection Readers and Initializers, it always is
+ available.</para>
+
+ <para>The content of this element is identical to the component descriptor for
+ specifying parameters (in the case where no parameter groups are
+ specified)<footnote><para>An earlier UIMA version required these to have a
+ suffix of <quote>_p</quote>, e.g., <quote>string_p</quote>. This is no
+ longer required, but this format is accepted, also, for backward
+ compatibility.</para></footnote>. Here is an example:
+
+
+ <programlisting><![CDATA[<configurationParameterSettings>
+ <nameValuePair>
+ <name>CivilianTitles</name>
+ <value>
+ <array>
+ <string>Mr.</string>
+ <string>Ms.</string>
+ <string>Mrs.</string>
+ <string>Dr.</string>
+ </array>
+ </value>
+ </nameValuePair>
+ ...
+</configurationParameterSettings>]]></programlisting></para>
+
+ </section>
+
+ <section
+ id="&tp;descriptor.cas_processors.individual.sofa_name_mappings">
+ <title><sofaNameMappings> Element</title>
+
+ <para>This optional element provides a mapping from defined Sofa names in the
+ component, or the default Sofa name (if the component does not declare any Sofa
+ names). The form of this element is:
+
+
+ <programlisting><sofaNameMappings>
+ <sofaNameMapping cpeSofaName="a_CPE_name"
+ componentSofaName="a_component_Name"/>
+ ...
+</sofaNameMappings></programlisting></para>
+
+ <para>There can be any number of<literal>
+ <sofaNameMapping></literal> elements contained in the
+ <literal><sofaNameMappings></literal> element. The
+ <literal>componentSofaName</literal> attribute is optional; leave it out to
+ specify a mapping for the <literal>_InitialView</literal> - that is, for
+ Single-View components.</para>
+
+ </section>
+
+ <section id="&tp;descriptor.cas_processors.run_in_separate_process">
+ <title><runInSeparateProcess> Element</title>
+
+ <para>The <literal><runInSeparateProcess></literal> element is
+ mandatory for <literal>local</literal> CAS Processors, but should not appear
+ for <literal>remote</literal> or <literal>integrated</literal> CAS
+ Processors. It enables the CPM to create external processes using the provided
+ runtime environment. Applications launched this way communicate with the CPM
+ using the Vinci protocol and connectivity is enabled by a local instance of the
+ VNS that the CPM manages. Since communication is based on Vinci, the application
+ need not be implemented in Java. Any language for which Vinci provides support
+ may be used to create an application, and the CPM will seamlessly communicate
+ with it. The overall structure of this element is:
+
+
+ <programlisting><![CDATA[<runInSeparateProcess>
+ <exec dir="[String]" executable="[String]">
+ <env key="[String]" value ="[String]"/>
+ ...
+ <arg>[String]</arg>
+ ...
+ </exec>
+</runInSeparateProcess>]]></programlisting></para>
+
+ <para>The <literal><exec></literal> element provides information
+ about how to execute the referenced CAS Processor. Two attributes are defined
+ for the <literal><exec></literal> element. The
+ <literal>dir</literal> attribute is currently not used – it is reserved
+ for future functionality. The <literal>executable</literal> attribute
+ specifies the actual Vinci service executable that will be run by the CPM, e.g.,
+ <literal>java</literal>, a batch script, an application (.exe), etc. The
+ executable must be specified with a fully qualified path, or be found in the
+ <literal>PATH</literal> of the CPM.</para>
+
+ <para>The <literal><exec></literal> element has two elements within it
+ that define parameters used to construct the command line for executing the CAS
+ Processor. These elements must be listed in the order in which they should be
+ defined for the CAS Processor.</para>
+
+ <para>The optional <literal><env></literal> element is used to set an
+ environment variable. The variable <literal>key</literal> will be set to
+ <literal>value</literal>. For example,
+
+
+ <programlisting><env key="CLASSPATH" value="C:Javalib"/></programlisting>
+ will set the environment variable <literal>CLASSPATH</literal> to the value
+ <literal>C:Javalib</literal>. The <literal><env></literal>
+ element may be repeated to set multiple environment variables. All of the
+ key/value pairs will be added to the environment by the CPM prior to launching the
+ executable.</para>
+ <note><para>The CPM actually adds ALL system environment variables when it
+ launches the program. It queries the Operating System for its current system
+ variables and one by one adds them to the program's process
+ configuration.</para></note>
+
+ <para>The <literal><arg></literal> element is used to specify arbitrary
+ string arguments that will appear on the command line when the CPM runs the
+ command specified in the <literal>executable</literal> attribute.</para>
+
+ <para>For example, the following would be used to invoke the UIMA Java
+ implementation of the Vinci service wrapper on a Java CAS Processor:
+
+
+ <programlisting><![CDATA[<runInSeparateProcess>
+ <exec executable="java">
+ <arg>-DVNS_HOST=localhost</arg>
+ <arg>-DVNS_PORT=9099</arg>
+ <arg>org.apache.uima.reference_impl.analysis_engine.service.
+vinci.VinciAnalysisEngineService_impl</arg>
+ <arg>C:uimadescdeployCasProcessor.xml</arg>
+ </exec>
+<runInSeparateProcess>]]></programlisting></para>
+
+ <para>This will cause the CPM to run the following command line when starting the
+ CAS Processor:
+
+
+ <programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9099
+ org.apache.uima.reference_impl.analysis_engine.service.vinci.\\
+ VinciAnalysisEngineService_impl
+ C:uimadescdeployCasProcessor.xml</programlisting></para>
+
+ <para>The first argument specifies that the Vinci Naming Service is running on the
+ <literal>localhost</literal>. The second argument specifies that the Vinci
+ Naming Service port number is <literal>9099</literal>. The third argument
+ (split over 2 lines in this documentation)
+ identifies the UIMA implementation of the Vinci service wrapper. This class
+ contains the <literal>main</literal> method that will execute. That main
+ method in turn takes a single argument – the filename for the CAS Processor
+ service deployment descriptor. Thus the last argument identifies the Vinci
+ service deployment descriptor file for the CAS Processor. Since this is the same
+ descriptor file specified earlier in the
+ <literal><descriptor></literal> element, the string
+ <literal>${descriptor}</literal> can be used to refer to the descriptor,
+ e.g.:
+
+
+ <programlisting><arg>${descriptor}</arg></programlisting></para>
+
+ <para>The CPM will expand this out to the service deployment descriptor file
+ referenced in the <literal><descriptor></literal> element.</para>
+
+ </section>
+
+ <section
+ id="&tp;descriptor.cas_processors.individual.deployment_parameters">
+ <title><deploymentParameters> Element</title>
+
+ <para>The <literal><deploymentParameters></literal> element defines
+ a number of deployment parameters that control how the CPM will interact with the
+ CAS Processor. This element has the following overall form:
+
+
+ <programlisting><deploymentParameters>
+ <parameter name="[String]" value="..." type="string|integer" />
+ ...
+</deploymentParameters></programlisting></para>
+
+ <para>The <literal>name</literal> attribute identifies the parameter, the
+ <literal>value</literal> attribute specifies the value that will be assigned
+ to the parameter, and the <literal>type</literal> attribute indicates the
+ type of the parameter, either <literal>string</literal> or
+ <literal>integer</literal>. The available parameters include:
+
+ <variablelist>
+
+ <varlistentry>
+ <term>service-access</term>
+ <listitem><para>string parameter whose value must be
+ <quote>exclusive</quote>, if present. This parameter is only
+ effective for remote deployments. It modifies the Vinci service
+ connections to be preallocated and dedicated, one service instance per
+ pipe-line. It is only relevant for non-Integrated deployement modes. If
+ there are fewer services instances that are available (and alive –
+ responding to a <quote>ping</quote> request) than there are pipelines,
+ the number of pipelines (the number of concurrent threads) is reduced to
+ match the number of available instances. If not specified, the VNS is
+ queried each time a service is needed, and a <quote>random</quote>
+ instance is assigned from the pool of available instances. If a services
+ dies during processing, the CPM will use its normal error handling
+ procedures to attempt to reconnect. The number of attempts is specified
+ in the CPE descriptor for each Cas Processor using the
+ <literal><maxConsecutiveRestarts value="10"
+ action="kill-pipeline"
+ waitTimeBetweenRetries="50"/></literal> xml element. The
+ <quote>value</quote> attribute is the number of reconnection tries;
+ the <quote>action</quote> says what to do if the retries exceed the
+ limit. The <quote>kill-pipeline</quote> action stops the pipeline
+ that was associated with the failing service (other pipelines will
+ continue to work). The CAS in process within a killed pipeline will be
+ dropped. These events are communicated to the application using the
+ normal event listener mechanism. The
+ <literal>waitTimeBetweenRetries</literal> says how many
+ milliseconds to wait inbetween attempts to reconnect.</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>vnsHost</term>
+ <listitem><para>(Deprecated) string parameter specifying the VNS host,
+ e.g., <literal>localhost</literal> for local CAS Processors, host
+ name or IP address of VNS host for remote CAS Processors. This parameter is
+ deprecated; use the parameter specification instead inside the Vinci
+ <emphasis>Service Client Descriptor</emphasis>, if needed. It is
+ ignored for integrated and local deployments. If present, for remote
+ deployments, it specifies the VNS Host to use, unless that is specified in
+ the Vinci <emphasis>Service Client Descriptor</emphasis>.</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>vnsPort</term>
+ <listitem><para>(Deprecated) integer parameter specifying the VNS port
+ number. This parameter is deprecated; use the parameter specification
+ instead inside the Vinci <emphasis>Service Client
+ Descriptor,</emphasis> if needed. It is ignored for integrated and
+ local deployments. If present, for remote deployments, it specifies the
+ VNS Port number to use, unless that is specified in the Vinci
+ <emphasis>Service Client Descriptor.</emphasis></para>
+ </listitem>
+ </varlistentry>
+ </variablelist></para>
+
+ <para>For example, the following parameters might be used with a CAS Processor
+ deployed in local mode:
+
+
+ <programlisting><deploymentParameters>
+ <parameter name="service-access" value="exclusive" type="string"/>
+</deploymentParameters></programlisting></para>
+
+ </section>
+
+ <section id="&tp;descriptor.cas_processors.individual.filter">
+ <title><filter> Element</title>
+
+ <para>The <filter> element is a required element but currently should be
+ left empty. This element is reserved for future use.</para>
+
+ </section>
+
+ <section id="&tp;descriptor.cas_processors.individual.error_handling">
+ <title><errorHandling> Element</title>
+
+ <para>The mandatory <literal><errorHandling></literal> element
+ defines error and restart policies for the CAS Processor. Each CAS Processor may
+ define different actions in the event of errors and restarts. The CPM monitors
+ and logs errant behaviors and attempts to recover the component based on the
+ policies specified in this element.</para>
+
+ <para>There are two kinds of faults:
+
+ <orderedlist><listitem><para>One kind only occurs with non-integrated CAS
+ Processors – this fault is either a timeout attempting to launch or
+ connect to the non-integrated component, or some other kind of connection
+ related exception (for instance, the network connection might timeout or get
+ reset).</para></listitem>
+
+ <listitem><para>The other kind happens when the CAS Processor component (an
+ Annotator, for example) throws any kind of exception. This kind may occur
+ with any kind of deployment, integrated or not. </para></listitem>
+ </orderedlist></para>
+
+ <para>The <errorHandling> has specifications for each of these kinds of
+ faults. The format of this element is:
+
+
+ <programlisting><![CDATA[<errorHandling>
+ <maxConsecutiveRestarts action="continue|disable|terminate"
+ value="[Number]"/>
+ <errorRateThreshold action="continue|disable|terminate" value="[Rate]"/>
+ <timeout max="[Number]"/>
+</errorHandling>]]></programlisting></para>
+
+ <para>The mandatory <literal><maxConsecutiveRestarts></literal>
+ element applies only to faults of the first kind, and therefore, only applies to
+ non-integrated deployments. If such a fault occurs, a retry is attempted, up to
+ <literal>value="[Number]"</literal> of times. This retry resets the
+ connection (if one was made) and attempts to reconnect and perhaps re-launch
+ (see below for details). The original CAS (not a partially updated one) is sent to
+ the CAS Processor as part of the retry, once the deployed component has been
+ successfully restarted or reconnected to.</para>
+
+ <para>The <literal>action</literal> attribute specifies the action to take
+ when the threshold specified by the <literal>value="[Number]"</literal> is
+ exceeded. The possible actions are:
+
+ <variablelist>
+ <varlistentry>
+ <term>continue</term>
+ <listitem><para>skip any further processing for this CAS by this CAS
+ Processor, and pass the CAS to the next CAS Processor in the Pipeline.
+ </para>
+ <para>The <quote>restart</quote> action is done, because it is needed
+ for the next CAS.</para>
+
+ <para>If the <literal>dropCasOnException="true"</literal>, the CPM
+ will NOT pass the CAS to the next CAS Processor in the chain. Instead, the
+ CPM will abort processing of this CAS, release the CAS back to the CAS
+ Pool and will process the next CAS in the queue.</para>
+
+ <para>The counter counting the restarts toward the threshold is only
+ reset after a CAS is successfully processed.</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>disable</term>
+ <listitem><para>the current CAS is handled just as in the
+ <literal>continue</literal> case, but in addition, the CAS Processor
+ is marked so that its <emphasis>process()</emphasis> method will not be
+ called again (i.e., it will be <quote>skipped</quote> for future
+ CASes)</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>terminate</term>
+ <listitem><para>the CPM will terminate all processing and exit.</para>
+ </listitem>
+ </varlistentry>
+ </variablelist></para>
+
+ <para>The definition of an error for the
+ <literal><maxConsecutiveRestarts></literal> element differs
+ slightly for each of the three CAS Processor deployment modes:
+ <variablelist>
+ <varlistentry>
+ <term>local</term>
+ <listitem><para>Local CAS Processors experience two general error
+ types:
+ <itemizedlist>
+ <listitem><para>launch errors – errors associated with
+ launching a process</para></listitem>
+ <listitem><para>processing errors – errors associated with
+ sending Vinci commands to the process</para></listitem>
+ </itemizedlist></para>
+
+ <para>A launch error is defined by a failure of the process to
+ successfully register with the local VNS within a default time window.
+ The current timeout is 15 minutes. Multiple local CAS Processors are
+ launched sequentially, with a subsequent processor launched
+ immediately after its previous processor successfully registers
+ with the VNS.</para>
+
+ <para>A processing error is detected if a connection to the CAS Processor
+ is lost or if the processing time exceeds a specified timeout
+ value.</para>
+
+ <para>For local CAS Processors, the
+ <maxConsecutiveRestarts> element specifies the number of
+ consecutive attempts made to launch the CAS Processor at CPM startup or
+ after the CPM has lost a connection to the CAS Processor.</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>remote</term>
+ <listitem><para>For remote CAS Processors, the
+ <maxConsecutiveRestarts> element applies to errors from
+ sending Vinci commands. An error is detected if a connection to the CAS
+ Processor is lost, or if the processing time exceeds the timeout value
+ specified in the <timeout> element (see below).</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>integrated</term>
+ <listitem><para>Although mandatory, the
+ <maxConsecutiveRestarts> element is NOT used for integrated CAS
+ Processors, because Integrated CAS Processors are not
+ re-instantiated/restarted on exceptions. This setting is ignored by
+ the CPM for Integrated CAS Processors but it is required. Future version
+ of the CPM will make this element mandatory for remote and local CAS
+ Processors only.</para></listitem>
+ </varlistentry>
+
+ </variablelist></para>
+
+ <para>The mandatory <literal><errorRateThreshold></literal> element
+ is used for all faults – both those above, and exceptions thrown by the CAS
+ Processor itself. It specifies the number of retries for exceptions thrown by
+ the CAS Processor itself, a maximum error rate, and the corresponding action to
+ take when this rate is exceeded. The <literal>value</literal> attribute
+ specifies the error rate in terms of errors per sample size in the form
+ <quote><literal>N/M</literal></quote>, where <literal>N</literal> is the
+ number of errors and <literal>M</literal> is the sample size, defined in terms
+ of the number of documents.</para>
+
+ <para>The first number is used also to indicate the maximum number of retries. If
+ this number is less than the <literal><maxConsecutiveRestarts
+ value="[Number]">, </literal>it will override, reducing the number of
+ <quote>restarts</quote> attempted. A retry is done only if the
+ <literal>dropCasOnException </literal>is false. If it is set to true, no retry
+ occurs, but the error is counted.</para>
+
+ <para>When the number of counted errors exceeds the sample size, an action
+ specified by the <literal>action</literal> attribute is taken. The possible
+ actions and their meaning are the same as described above for the
+ <literal><maxConsecutiveRestarts></literal> element:
+ <itemizedlist spacing="compact">
+ <listitem><para><literal>continue</literal></para></listitem>
+ <listitem><para><literal>disable</literal></para></listitem>
+ <listitem><para><literal>terminate</literal></para></listitem>
+ </itemizedlist></para>
+
+ <para>The <literal>dropCasOnException="true"</literal> attribute of the
+ <literal><casProcessors></literal> element modifies the action
+ taken for continue and disable, in the same manner as above. For example:
+
+
+ <programlisting><errorRateThreshold value="3/1000" action="disable"/></programlisting>
+ specifies that each error thrown by the CAS Processor itself will be retried up to
+ 3 times (if <literal>dropCasOnException</literal> is false) and the CAS
+ Processor will be disabled if the error rate exceeds 3 errors in 1000
+ documents.</para>
+
+ <para>If a document causes an error and the error rate threshold for the CAS
+ Processor is not exceeded, the CPM increments the CAS Processor's error
+ count and retries processing that document (if
+ <literal>dropCasOnException</literal> is false). The retry means that the
+ CPM calls the CAS Processor's process() method again, passing in as an
+ argument the same CAS that previously caused an exception.</para>
+ <note><para>The CPM does not attempt to rollback any partial changes that may have
+ been applied to the CAS in the previous process() call. </para></note>
+
+ <para>Errors are accumulated across documents. For example, assume the error
+ rate threshold is <literal>3/1000</literal>. The same document may fail three
+ times before finally succeeding on the fourth try, but the error count is now 3. If
+ one more error occurs within the current sample of 1000 documents, the error rate
+ threshold will be exceeded and the specified action will be taken. If no more
+ errors occur within the current sample, the error counter is reset to 0 for the
+ next sample of 1000 documents.</para>
+
+ <para>The <literal><timeout></literal> element is a mandatory element.
+ Although mandatory for all CAS Processors, this element is only relevant for
+ local and remote CAS Processors. For integrated CAS Processors, this element is
+ ignored. In the current CPM implementation the integrated CAS Processor
+ process() method is not subject to timeouts.</para>
+
+ <para>The <literal>max</literal> attribute specifies the maximum amount of
+ time in milliseconds the CPM will wait for a process() method to complete When
+ exceeded, the CPM will generate an exception and will treat this as an error
+ subject to the threshold defined in the
+ <literal><errorRateThreshold></literal> element above, including
+ doing retries.</para>
+
+ <section
+ id="&tp;descriptor.cas_processors.individual.error_handling.timeout_retry_action">
+ <title>Retry action taken on a timeout</title>
+
+ <para>The action taken depends on whether the CAS Processor is local (managed)
+ or remote (unmanaged). Local CAS Processors (which are services) are killed
+ and restarted, and a new connection to them is established. For remote CAS
+ Processors, the connection to them is dropped, and a new connection is
+ reestablished (which may actually connect to a different instance of the
+ remote services, if it has multiple instances).</para>
+ </section>
+ </section>
+
+ <section id="&tp;descriptor.cas_processors.individual.checkpoint">
+ <title><checkpoint> Element</title>
+
+ <para>The <literal><checkpoint></literal> element is an optional
+ element used to improve the performance of CAS Consumers. It has a single
+ attribute, <literal>batch</literal>, which specifies the number of CASes in a
+ batch, e.g.:
+
+
+ <programlisting><checkpoint batch="1000"></programlisting></para>
+
+ <para>sets the batch size to 1000 CASes. The batch size is the interval used to mark a
+ point in processing requiring special handling. The CAS Processor's
+ <literal>batchProcessComplete()</literal> method will be called by the CPM
+ when this mark is reached so that the processor can take appropriate action. This
+ mark could be used as a mechanism to buffer up results in CAS Consumers and perform
+ time-consuming operations, such as check-pointing, that should not be done on a
+ per-document basis.</para>
+
+ </section>
+ </section>
+ </section>
+
+ <section id="&tp;descriptor.operational_parameters">
+ <title>CPE Operational Parameters</title>
+
+ <para>The parameters for configuring the overall CPE and CPM are specified in the
+ <literal><cpeConfig></literal> section. The overall format of this
+ section is:
+
+
+ <programlisting><![CDATA[<cpeConfig>
+ <startAt>[NumberOrID]</startAt>
+
+ <numToProcess>[Number]</numToProcess>
+
+ <outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" />
+
+ <checkpoint file="[File]" time="[Number]" batch="[Number]"/>
+
+ <timerImpl>[ClassName]</timerImpl>
+
+ <deployAs>vinciService|interactive|immediate|single-threaded
+ </deployAs>
+
+</cpeConfig>]]></programlisting></para>
+
+ <para>This section of the CPE descriptor allows for defining the starting entity, the
+ number of entities to process, a checkpoint file and frequency, a pluggable timer, an
+ optional output queue implementation, and finally a mode of operation. The mode of
+ operation determines how the CPM interacts with users and other systems.</para>
+
+ <para>The <literal><startAt></literal> element is an optional argument. It
+ defines the starting entity in the collection at which the CPM should start
+ processing.</para>
+
+ <para>The implementation in the CPM passes this argument to the Collection Reader
+ as the value of the parameter <quote><literal>startNumber</literal></quote>.
+ The CPM does not do anything else with this parameter; in particular, the CPM has no
+ ability to skip to a specific document - that function, if available, is only provided
+ by a particular Collection Reader implementation.</para>
+
+ <para>If the <literal><startAt></literal> element is used, the Collection
+ Reader descriptor must define a single-valued configuration parameter with the
+ name <literal>startNumber</literal>. It can declare this value to be of any type;
+ the value passed in this XML element must be convertible to that type.</para>
+
+ <para>A typical use is to declare this to be an integer type, and to pass the sequential
+ document number where processing should start. An alternative implementation
+ might take a specific document ID; the collection reader could search through its
+ collection until it reaches this ID and then start there.</para>
+
+ <para>This parameter will only make sense if the particular collection reader is
+ implemented to use the <literal>startNumber</literal> configuration
+ parameter.</para>
+
+ <para>The <literal><numToProcess></literal> element is an optional
+ element. It specifies the total number of entities to process. Use -1 to indicate ALL.
+ If not defined, the number of entities to process will be taken from the Collection
+ Reader configuration. If present, this value overrides the Collection Reader
+ configuration.</para>
+
+ <para>The <literal><outputQueue></literal> element is an optional element.
+ It enables plugging in a custom implementation for the Output Queue. When omitted,
+ the CPM will use a default output queue that is based on First-in First-out (FIFO)
+ model.</para>
+
+ <para>The UIMA SDK provides a second implementation for the Output Queue that can be
+ plugged in to the CPM, named <quote>
+ <literal>org.apache.uima.collection.impl.cpm.engine.SequencedQueue</literal>
+ </quote>.</para>
+
+ <para>This implementation supports handling very large documents that are split into
+ <quote>chunks</quote>; it provides a delivery mechanism that insures the
+ sequential order of the chunks using information carried in the CAS metadata. This
+ metadata, which is required for this implementation to work correctly, must be added
+ as an instance of a Feature Structure of type
+ <literal>org.apache.es.tt.DocumentMetaData</literal> and referred to by an
+ additional feature named <literal>esDocumentMetaData</literal> in the special
+ instance of <literal>uima.tcas.DocumentAnnotation</literal> that is
+ associated with the CAS. This is usually done by the Collection Reader; the instance
+ contains the following features:
+
+ <variablelist>
+ <varlistentry>
+ <term>sequenceNumber</term>
+ <listitem><para>[Number] the sequential number of a chunk, starting at 1. If
+ not a chunk (i.e. complete document), the value should be 0.</para>
+ </listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>documentId</term>
+ <listitem><para>[Number] current document id. Chunks belonging to the same
+ document have identical document id.</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>isCompleted</term>
+ <listitem><para>[Number] 1 if the chunk is the last in a sequence, 0
+ otherwise.</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>url</term>
+ <listitem><para>[String] document url.</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>throttleID</term>
+ <listitem><para>[String] special attribute currently used by
+ OmniFind.</para></listitem>
+ </varlistentry>
+ </variablelist></para>
+
+ <para>This implementation of a sequenced queue supports proper sequencing of CASes in
+ CPM deployments that use document chunking. Chunking is a technique of splitting
+ large documents into pieces to reduce overall memory consumption. Chunking does not
+ depend on the number of CASes in the CAS Pool. It works equally well with one or more
+ CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work
+ Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a
+ CAS is released back to the pool by the processing threads. A document may be split into
+ 1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the
+ document correctly, the CAS Consumer can depend on receiving the chunks in the same
+ sequential order that the chunks were <quote>produced</quote>, when this
+ sequenced queue implementation is used. To plug in this sequenced queue to the CPM use
+ the following specification:
+
+
+ <programlisting><outputQueue dequeueTimeout="100000" queueClass=
+"org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/></programlisting>
+
+ where the mandatory <literal>queueClass</literal> attribute defines the name of
+ the class and the second mandatory attribute, <literal>dequeueTimeout</literal>
+ specifies the maximum number of milliseconds to wait for the expected chunk.</para>
+
+ <note><para>The value for this timeout must be carefully determined to avoid
+ excessive occurrences of timeouts. Typically, the size of a chunk and the type of
+ analysis being done are the most important factors when deciding on the value for the
+ timeout. The larger the chunk and the more complicated analysis, the more time it takes
+ for the chunk to go from source to sink. You may specify 0, in which case, the timeout is
+ disabled - i.e., it is equivalent to an infinitely long timeout.</para></note>
+
+ <para>If the chunk doesn't arrive in the configured time window, the entire
+ document is presumed to be invalid and the CAS is dropped from further processing.
+ This action occurs regardless of any other error action specification. The
+ SequencedQueue invalidate the document, adding the offending document's
+ metadata to a local cache of invalid documents. </para>
+
+ <para>If the time out occurs, the CPM notifies all registered listeners (see <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.cpe.using_listeners"/>) by calling
+ entityProcessComplete(). As part of this call, the SequencedQueue will pass null
+ instead of a CAS as the first argument, and a special exception –
+ CPMChunkTimeoutException. The reason for passing null as the first argument is
+ because the time out occurs due to the fact that the chunk has not been received in the
+ configured timeout window, so there is no CAS available when the timeout event
+ occurs.</para>
+
+ <para>The CPMChunkTimeoutException object includes an API that allows the listener
+ to retrieve the offending document id as well as the other metadata attributes as
+ defined above. These attributes are part of each chunk's metadata and are added
+ by the Collection Reader.</para>
+
+ <para>Each chunk that SequencedQueue works on is subjected to a test to determine if the
+ chunk belongs to an invalid document. This test checks the chunk's metadata
+ against the data in the local cache. If there is a match, the chunk is dropped. This
+ check is only performed for chunks and complete documents are not subject to this
+ check.</para>
+
+ <para>If there is an exception during the processing of a chunk, the CPM sends a
+ notification to all registered listeners. The notification includes the CAS and an
+ exception. When the listener notification is completed, the CPM also sends separate
+ notifications, containing the CAS, to the Artifact Producer and the
+ SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong
+ to an <quote>invalid</quote> document and also to deal with chunks that are
+ en-route, being processed by the processing threads.</para>
+
+ <para>In response to the notification, the Artifact Producer will drop and release
+ back to the CAS Pool all CASes that belong to an <quote>invalid</quote> document.
+ Currently, there is no support in the CollectionReader's API to tell it to stop
+ generating chunks. The CollectionReader keeps producing the chunks but the
+ Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is
+ released back to the CAS Pool, the Artifact Producer sends notification to all
+ registered listeners. This notification includes the CAS and an exception –
+ SkipCasException.</para>
+
+ <para>In response to the notification of an exception involving a chunk, the
+ SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of
+ <quote>invalid</quote> documents. All chunks de-queued from the OutputQueue and
+ belonging to <quote>invalid</quote> documents will be dropped and released back to
+ the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered
+ listeners. The notification includes the CAS and SkipCasException.</para>
+
+ <para>The <literal><checkpoint></literal> element is an optional element.
+ It specifies a CPE checkpoint file, checkpoint frequency, and strategy for
+ checkpoints (time or count based). At checkpoint time, the CPM saves status
+ information and statistics to the checkpoint file. The checkpoint file is specified
+ in the <literal>file</literal> attribute, which has the same form as the
+ <literal>href</literal> attribute of the <literal><include></literal>
+ element described in <xref linkend="&tp;imports"/>. The
+ <literal>time</literal> attribute indicates that a checkpoint should be taken
+ every <literal>[Number]</literal> seconds, and the <literal>batch</literal>
+ attribute indicates that a checkpoint should be taken every
+ <literal>[Number]</literal> batches.</para>
+
+ <para>The <literal><timerImpl></literal> element is optional. It is used to
+ identify a custom timer plug-in class to generate time stamps during the CPM
+ execution. The value of the element is a Java class name.</para>
+
+ <para>The <literal><deployAs></literal> element indicates the type of CPM
+ deployment. Valid contents for this element include:
+
+ <variablelist>
+ <varlistentry>
+ <term>vinciService</term>
+ <listitem><para>Vinci service exposing APIs for stop, pause, resume, and
+ getStats</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>interactive</term>
+ <listitem><para>provide command line menus (start, stop, pause,
+ resume)</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>immediate</term>
+ <listitem><para>run the CPM without menus or a service API</para></listitem>
+ </varlistentry>
+ <varlistentry>
+ <term>single-threaded</term>
+ <listitem><para>run the CPM in a single threaded mode. In this mode, the
+ Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline
+ are all running in one thread without the work queue and the output
+ queue.</para></listitem>
+ </varlistentry>
+ </variablelist></para>
+
+ </section>
+
+ <section id="&tp;descriptor.resource_manager_configuration">
+ <title>Resource Manager Configuration</title>
+
+ <para>External resource bindings for the CPE may optionally be specified in an
+ element:
+
+
+ <programlisting><resourceManagerConfiguration href="..."/></programlisting></para>
+
+ <para>For an introduction to external resources, refer to <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.aae.accessing_external_resource_files"/>.</para>
+
+ <para>In the <literal>resourceManagerConfiguration</literal> element, the value
+ of the href attribute refers to another file that contains definitions and bindings
+ for the external resources used by the CPE. The format of this file is the same as the XML
+ snippet <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings"/>
+ . For example, in a CPE containing an aggregate analysis engine with two annotators,
+ and a CAS Consumer, the following resource manager configuration file would bind
+ external resource dependencies in all three components to the same physical
+ resource:
+
+
+ <programlisting><![CDATA[<resourceManagerConfiguration>
+
+ <!-- Declare Resource -->
+
+ <externalResources>
+ <externalResource>
+ <name>ExampleResource</name>
+ <fileResourceSpecifier>
+ <fileUrl>file:MyResourceFile.dat</fileUrl>
+ </fileResourceSpecifier>
+ </externalResource>
+ </externalResources>
+
+ <!-- Bind component resource dependencies to ExampleResource -->
+
+ <externalResourceBindings>
+ <externalResourceBinding>
+ <key>MyAE/annotator1/myResourceKey</key>
+ <resourceName>ExampleResource</resourceName>
+ </externalResourceBinding>
+
+ <externalResourceBinding>
+ <key>MyAE/annotator2/someResourceKey</key>
+ <resourceName>ExampleResource</resourceName>
+ </externalResourceBinding>
+
+ <externalResourceBinding>
+ <key>MyCasConsumer/otherResourceKey</key>
+ <resourceName>ExampleResource</resourceName>
+ </externalResourceBinding>
+
+ </externalResourceBindings>
+
+</resourceManagerConfiguration>]]></programlisting></para>
+
+ <para>In this example, <literal>MyAE</literal> and
+ <literal>MyCasConsumer</literal> are the names of the Analysis Engine and CAS
+ Consumer, as specified by the name attributes of the CPE's
+ <literal><casProcessor></literal> elements.
+ <literal>annotator1</literal> and <literal>annotator2</literal> are the
+ annotator keys specified within the Aggregate AE Descriptor, and
+ <literal>myResourceKey</literal>, <literal>someResourceKey</literal>, and
+ <literal>otherResourceKey</literal> are the keys of the resource dependencies
+ declared in the individual annotator and CAS Consumer descriptors.</para>
+
+ </section>
+
+ <section id="&tp;descriptor.example">
+ <title>Example CPE Descriptor</title>
+
+
+ <programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
+<cpeDescription>
+ <collectionReader>
+ <collectionIterator>
+ <descriptor>
+ <import location=
+ "../collection_reader/FileSystemCollectionReader.xml"/>
+ </descriptor>
+ </collectionIterator>
+ </collectionReader>
+ <casProcessors dropCasOnException="true" casPoolSize="1"
+ processingUnitThreadCount="1">
+ <casProcessor deployment="integrated"
+ name="Aggregate TAE - Name Recognizer and Person Title Annotator">
+ <descriptor>
+ <import location=
+ "../analysis_engine/NamesAndPersonTitles_TAE.xml"/>
+ </descriptor>
+ <deploymentParameters/>
+ <filter/>
+ <errorHandling>
+ <errorRateThreshold action="terminate" value="100/1000"/>
+ <maxConsecutiveRestarts action="terminate" value="30"/>
+ <timeout max="100000"/>
+ </errorHandling>
+ <checkpoint batch="1"/>
+ </casProcessor>
+ <casProcessor deployment="integrated" name="Annotation Printer">
+ <descriptor>
+ <import location="../cas_consumer/AnnotationPrinter.xml"/>
+ </descriptor>
+ <deploymentParameters/>
+ <filter/>
+ <errorHandling>
+ <errorRateThreshold action="terminate" value="100/1000"/>
+ <maxConsecutiveRestarts action="terminate" value="30"/>
+ <timeout max="100000"/>
+ </errorHandling>
+ <checkpoint batch="1"/>
+ </casProcessor>
+ </casProcessors>
+ <cpeConfig>
+ <numToProcess>1</numToProcess>
+ <deployAs>immediate</deployAs>
+ <checkpoint file="" time="3000"/>
+ <timerImpl/>
+ </cpeConfig>
+</cpeDescription>]]></programlisting>
+ </section>
+
+</chapter>
\ No newline at end of file
Added: uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml
URL: http://svn.apache.org/viewvc/uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml?rev=941739&view=auto
==============================================================================
--- uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml (added)
+++ uima/uimaj/branches/mavenAlign/uima-docbook-references/src/docbook/references.xml Thu May 6 14:01:56 2010
@@ -0,0 +1,35 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd">
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<book lang="en">
+ <title>UIMA References</title>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../target/docbook-shared/common_book_info_ibm_c.xml"/>
+
+ <toc/>
+
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.javadocs.xml"/>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.xml.component_descriptor.xml"/>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.xml.cpe_descriptor.xml"/>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.cas.xml"/>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.jcas.xml"/>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.pear.xml"/>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="ref.xmi.xml"/>
+</book>