You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2008/08/28 23:28:16 UTC
svn commit: r689997 [14/32] - in /incubator/uima/uimaj/trunk/uima-docbooks:
./ src/ src/docbook/overview_and_setup/ src/docbook/references/
src/docbook/tools/ src/docbook/tutorials_and_users_guides/
src/docbook/uima/organization/ src/olink/references/
Modified: incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/references/ref.xml.cpe_descriptor.xml
URL: http://svn.apache.org/viewvc/incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/references/ref.xml.cpe_descriptor.xml?rev=689997&r1=689996&r2=689997&view=diff
==============================================================================
--- incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/references/ref.xml.cpe_descriptor.xml (original)
+++ incubator/uima/uimaj/trunk/uima-docbooks/src/docbook/references/ref.xml.cpe_descriptor.xml Thu Aug 28 14:28:14 2008
@@ -1,1368 +1,1368 @@
-<?xml version="1.0" encoding="UTF-8"?>
-<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
-"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
-<!ENTITY imgroot "../images/references/ref.xml.cpe_descriptor/">
-<!ENTITY tp "ugr.ref.xml.cpe_descriptor.">
-<!ENTITY % uimaents SYSTEM "../entities.ent" >
-%uimaents;
-]>
-<!--
-Licensed to the Apache Software Foundation (ASF) under one
-or more contributor license agreements. See the NOTICE file
-distributed with this work for additional information
-regarding copyright ownership. The ASF licenses this file
-to you under the Apache License, Version 2.0 (the
-"License"); you may not use this file except in compliance
-with the License. You may obtain a copy of the License at
-
- http://www.apache.org/licenses/LICENSE-2.0
-
-Unless required by applicable law or agreed to in writing,
-software distributed under the License is distributed on an
-"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-KIND, either express or implied. See the License for the
-specific language governing permissions and limitations
-under the License.
--->
-<chapter id="ugr.ref.xml.cpe_descriptor">
- <title>Collection Processing Engine Descriptor Reference</title>
- <titleabbrev>CPE Descriptor Reference</titleabbrev>
-
- <para>A UIMA <emphasis>Collection Processing Engine</emphasis> (CPE) is a combination
- of UIMA components assembled to analyze a collection of artifacts. A CPE is an
- instantiation of the UIMA <emphasis>Collection Processing Architecture</emphasis>,
- which defines the collection processing components, interfaces, and APIs. A CPE is
- executed by a UIMA framework component called the <emphasis>Collection Processing
- Manager</emphasis> (CPM), which provides a number of services for deploying CPEs,
- running CPEs, and handling errors.</para>
-
- <para>A CPE can be assembled programmatically within a Java application, or it can be
- assembled declaratively via a CPE configuration specification, called a CPE
- Descriptor. This chapter describes the format of the CPE Descriptor.</para>
-
- <para>Details about the CPE, including its function, sub-components, APIs, and related
- tools, can be found in <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.cpe"/>. Here we briefly summarize the CPE to define terms and
- provide context for the later sections that describe the CPE Descriptor.</para>
-
- <section id="&tp;overview">
- <title>CPE Overview</title>
-
- <figure id="&tp;overview.fig.runtime">
- <title>CPE Runtime Overview</title>
- <mediaobject>
- <imageobject>
- <imagedata width="5.8in" format="PNG"
- fileref="&imgroot;image002.png"/>
- </imageobject>
- <textobject><phrase>CPE Runtime Overview diagram</phrase></textobject>
- </mediaobject>
- </figure>
-
- <para>An illustration of the CPE runtime is shown in <xref
- linkend="&tp;overview.fig.runtime"/>. Some of the CPE components, such as the
- <emphasis>queues</emphasis> and <emphasis>processing pipelines</emphasis>, are
- internal to the CPE, but their behavior and deployment may be configured using the CPE
- Descriptor. Other CPE components, such as the <emphasis>Collection
- Reader</emphasis> and <emphasis>CAS Processors</emphasis>, are defined and
- configured externally from the CPE and then plugged in to the CPE to create the overall
- engine. The parts of a CPE are:
-
- <variablelist>
- <varlistentry>
- <term>Collection Reader</term>
- <listitem><para>understands the native data collection format and iterates
- over the collection producing subjects of analysis</para></listitem>
- </varlistentry>
-
- <varlistentry>
- <term>CAS Initializer<footnote><para>Deprecated</para></footnote>
- </term>
- <listitem><para>initializes a CAS with a subject of analysis</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Artifact Producer</term>
- <listitem><para>asynchronously pulls CASes from the Collection Reader,
- creates batches of CASes and puts them into the work queue</para></listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Work Queue</term>
- <listitem><para>shared queue containing batches of CASes queued by the Artifact
- Producer for analysis by Analysis Engines</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>B1-Bn</term>
- <listitem><para>individual batches containing 1 or more CASes</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>AE1-AEn</term>
- <listitem><para>Analysis Engines arranged by a CPE descriptor</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Processing Pipelines</term>
- <listitem><para>each pipeline runs in a separate thread and contains a
- replicated set of the Analysis Engines running in the defined sequence</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>Output Queue</term>
- <listitem><para>holds batches of CASes with analysis results intended for CAS
- Consumers</para></listitem>
- </varlistentry>
-
- <varlistentry>
- <term>CAS Consumers</term>
- <listitem><para>perform collection level analysis over the CASes and extract
- analysis results, e.g., creating indexes or databases</para></listitem>
- </varlistentry>
- </variablelist>
- </para>
- </section>
-
- <section id="&tp;notation">
- <title>Notation</title>
-
- <para>CPE Descriptors are XML files. This chapter uses an informal notation to specify
- the syntax of CPE Descriptors.</para>
-
- <para>The notation used in this chapter is:
-
- <itemizedlist><listitem><para>An ellipsis (...) inside an element body indicates
- that the substructure of that element has been omitted (to be described in another
- section of this chapter). An example of this would be:
-
-
- <programlisting><collectionReader>
-...
-</collectionReader></programlisting></para>
- </listitem>
-
- <listitem><para>An ellipsis immediately after an element indicates that the
- element type may be repeated arbitrarily many times. For example:
-
-
- <programlisting><parameter>[String]</parameter>
-<parameter>[String]</parameter>
-...</programlisting>
- indicates that there may be arbitrarily many parameter elements in this
- context.</para></listitem>
-
- <listitem><para>An ellipsis inside an element means details of the attributes
- associated with that element are defined later, e.g.:
-
- <programlisting><casProcessor ...></programlisting></para>
- </listitem>
-
- <listitem><para>Bracketed expressions (e.g. <literal>[String]</literal>)
- indicate the type of value that may be used at that location.</para></listitem>
-
- <listitem><para>A vertical bar, as in <literal>true|false</literal>, indicates
- alternatives. This can be applied to literal values, bracketed type names, and
- elements. </para></listitem></itemizedlist></para>
-
- <para>Which elements are optional and which are required is specified in prose, not in the
- syntax definition.</para>
-
- </section>
-
- <section id="&tp;imports">
- <title>Imports</title>
-
- <para>As of version 2.2, a CPE Descriptor can use the same <literal>import</literal> mechanism
- as other component descriptors. This allows referring to component
- descriptors using either relative paths (resolved relative to the location of the CPE descriptor)
- or the classpath/datapath. For details see <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor"/>.</para>
-
- <para>The follwing older syntax is still supported, but <emphasis>not recommended</emphasis>:
-
- <programlisting><![CDATA[<descriptor>
- <include href="[URL or File]"/>
-</descriptor>]]></programlisting></para>
-
- <para>The <literal>[URL or File]</literal> attribute is a URL or a filename for the descriptor of the
- incorporated component. The argument is first attempted to be resolved as a URL.</para>
-
- <para>
- Relative paths in an <literal>include</literal> are resolved relative to the current working directory
- (NOT the CPE descriptor location as is the case for <literal>import</literal>).
- A filename relative to another directory can be specified using the <literal>CPM_HOME</literal>
- variable, e.g.,
- <programlisting><descriptor>
- <include href="${CPM_HOME}/desc_dir/descriptor.xml"/>
-</descriptor></programlisting>
-
- In this case, the value for the <literal>CPM_HOME</literal> variable must be
- provided to the CPE by specifying it on the Java command line, e.g.,
-
- <programlisting>java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</programlisting>
-
- </para>
-
- </section>
-
- <section id="&tp;descriptor">
- <title>CPE Descriptor Overview</title>
-
- <para>A CPE Descriptor consists of information describing the following four main
- elements.</para>
-
- <orderedlist><listitem><para>The <emphasis>Collection Reader</emphasis>, which
- is responsible for gathering artifacts and initializing the Common Analysis
- Structure (CAS) used to support processing in the UIMA collection processing
- engine.</para></listitem>
-
- <listitem><para>The <emphasis>CAS Processors</emphasis>, responsible for
- analyzing individual artifacts, analyzing across artifacts, and extracting
- analysis results. CAS Processors include <emphasis>Analysis Engines</emphasis>
- and <emphasis>CAS Consumers</emphasis>.</para></listitem>
-
- <listitem><para>Operational parameters of the <emphasis>Collection Processing
- Manager</emphasis> (CPM), such as checkpoint frequency and deployment
- mode.</para></listitem>
-
- <listitem><para>Resource Manager Configuration (optional). </para></listitem>
- </orderedlist>
-
- <para>The CPE Descriptor has the following high level skeleton:
-
-
- <programlisting><![CDATA[<?xml version="1.0"?>
-<cpeDescription>
- <collectionReader>
-...
- </collectionReader>
- <casProcessors>
-...
- </casProcessors>
- <cpeConfig>
-...
- </cpeConfig>
- <resourceManagerConfiguration>
-...
- </resourceManagerConfiguration>
-</cpeDescription>]]></programlisting></para>
-
- <para>Details of each of the four main elements are described in the sections that
- follow.</para>
- </section>
- <section id="&tp;descriptor.collection_reader">
- <title>Collection Reader</title>
-
- <para>The <literal><collectionReader></literal> section identifies the
- Collection Reader and optional CAS Initializer that are to be used in the CPE. The
- Collection Reader is responsible for retrieval of artifacts from a collection
- outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2)
- is responsible for initializing the CAS with the artifact.</para>
-
- <para>A Collection Reader may initialize the CAS itself, in which case it does not
- require a CAS Initializer. This should be clearly specified in the documentation for
- the Collection Reader. Specifying a CAS Initializer for a Collection Reader that
- does not make use of a CAS Initializer will not cause an error, but the specified CAS
- Initializer will not be used.</para>
-
- <para>The complete structure of the <literal><collectionReader></literal>
- section is:
-
-
- <programlisting><![CDATA[<collectionReader>
- <collectionIterator>
- <descriptor>
- <import ...> | <include .../>
- </descriptor>
- <configurationParameterSettings>...</configurationParameterSettings>
- <sofaNameMappings>...</sofaNameMappings>
- </collectionIterator>
- <casInitializer>
- <descriptor>
- <import ...> | <include .../>
- </descriptor>
- <configurationParameterSettings>...</configurationParameterSettings>
- <sofaNameMappings>...</sofaNameMappings>
- </casInitializer>
-</collectionReader>]]></programlisting></para>
-
- <para>The <literal><collectionIterator></literal> identifies the
- descriptor for the Collection Reader, and the <literal><casInitializer>
- </literal>identifies the descriptor for the CAS Initializer. The format and
- details of the Collection Reader and CAS Initializer descriptors are described in
- <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader"/>
- . The <literal><configurationParameterSettings> </literal>and the
- <literal><sofaNameMappings></literal> elements are described in the next
- section.</para>
-
- <section id="&tp;descriptor.collection_reader.error_handling">
- <title>Error handling for Collection Readers</title>
-
- <para>The CPM will abort if the Collection Reader throws a large number of
- consecutive exceptions (default = 100). This default can by changed by using the
- Java initialization parameter <literal>-DMaxCRErrorThreshold
- xxx.</literal></para>
- </section>
- </section>
-
- <section id="&tp;descriptor.cas_processors">
- <title>CAS Processors</title>
-
- <para>The <literal><casProcessors></literal> section identifies the
- components that perform the analysis on the input data, including CAS analysis
- (Analysis Engines) and analysis results extraction (CAS Consumers). The CAS
- Consumers may also perform collection level analysis, where the analysis is
- performed (or aggregated) over multiple CASes. The basic structure of the CAS
- Processors section is:
-
-
- <programlisting><![CDATA[<casProcessors
- dropCasOnException="true|false"
- casPoolSize="[Number]"
- processingUnitThreadCount="[Number]">
-
- <casProcessor ...>
- ...
- </casProcessor>
-
- <casProcessor ...>
- ...
- </casProcessor>
- ...
-</casProcessors>]]></programlisting></para>
-
- <para>The <literal><casProcessors></literal> section has two mandatory
- attributes and one optional attribute that configure the characteristics of the CAS
- Processor flow in the CPE. The first mandatory attribute is a casPoolSize, which
- defines the fixed number of CAS instances that the CPM will create and use during
- processing. All CAS instances are maintained in a CAS Pool with a check-in and
- check-out access. Each CAS is checked-out from the CAS Pool by the Collection Reader
- and initialized with an initial subject of analysis. The CAS is checked-in into the
- CAS Pool when it is completely processed, at the end of the processing chain. A larger
- CAS Pool size will result in more memory being used by the CPM. CAS objects can be large
- and care should be taken to determine the optimum size of the CAS Pool, weighing memory
- tradeoffs with performance.</para>
-
- <para>The second mandatory <literal><casProcessors></literal> attribute
- is <literal>processingUnitThreadCount</literal>, which specifies the number of
- replicated <emphasis>Processing Pipelines</emphasis>. Each Processing
- Pipeline runs in its own thread. The CPM takes CASes from the work queue and submits
- each CAS to one of the Processing Pipelines for analysis. A Processing Pipeline
- contains one or more Analysis Engines invoked in a given sequence. If more than one
- Processing Pipeline is specified, the CPM replicates instances of each Analysis
- Engine defined in the CPE descriptor. Each Processing Pipeline thread runs
- independently, consuming CASes from work queue and depositing CASes with analysis
- results onto the output queue. On multiprocessor machines, multiple Processing
- Pipelines can run in parallel, improving overall throughput of the CPM.</para>
- <note><para>The number of Processing Pipelines should be equal to or greater than CAS
- Pool size. </para></note>
-
- <para>Elements in the pipeline (each represented by a <casProcessor> element)
- may indicate that they do not permit multiple deployment in their Analysis Engine
- descriptor. If so, even though multiple pipelines are being used, all CASes passing
- through the pipelines will be routed through one instance of these marked Engines.
- </para>
-
- <para>The final, optional, <casProcessors> attribute is
- <literal>dropCasOnException</literal>. It defines a policy that determines what
- happens with the CAS when an exception happens during processing. If the value of this
- attribute is set to true and an exception happens, the CPM will notify all registered
- listeners of the exception (see <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.cpe.using_listeners"/>), clear the CAS and check the CAS
- back into the CAS Pool so that it can be re-used. The presumption is that an exception
- may leave the CAS in an inconsistent state and therefore that CAS should not be allowed
- to move through the processing chain. When this attribute is omitted the CPM's
- default is the same as specifying
- <literal>dropCasOnException="false"</literal>.</para>
-
- <section id="&tp;descriptor.cas_processors.individual">
- <title>Specifying an Individual CAS Processor</title>
-
- <para>The CAS Processors that make up the Processing Pipeline and the CAS Consumer
- pipeline are specified with the <literal><casProcessor></literal>
- entity, which appears within the <literal><casProcessors></literal>
- entity. It may appear multiple times, once for each CAS Processor specified for
- this CPE.</para>
-
- <para>The order of the <literal><casProcessor></literal> entities with
- the <literal><casProcessors></literal> section specifies the order in
- which the CAS Processors will run. Although CAS Consumers are usually put at the end
- of the pipeline, they need not be. Also, Aggregate Analysis Engines may include CAS
- Consumers.</para>
-
- <para>The overall format of the <literal><casProcessor></literal> entity
- is:
-
-
- <programlisting><![CDATA[<casProcessor deployment="local|remote|integrated" name="[String]" >
- <descriptor>
- <import ...> | <include .../>
- </descriptor>
- <configurationParameterSettings>...</configurationParameterSettings>
- <sofaNameMappings>...</sofaNameMappings>
- <runInSeparateProcess>...</runInSeparateProcess>
- <deploymentParameters>...</deploymentParameters>
- <filter/>
- <errorHandling>...</errorHandling>
- <checkpoint batch="Number"/>
-</casProcessor>]]></programlisting></para>
-
- <para>The <literal><casProcessor></literal> element has two mandatory
- attributes, <literal>deployment</literal> and <literal>name</literal>. The
- mandatory <literal>name</literal> attribute specifies a unique string
- identifying the CAS Processor.</para>
-
- <para>The mandatory <literal>deployment</literal> attribute specifies the CAS
- Processor deployment mode. Currently, three deployment options are supported:
-
- <variablelist>
- <varlistentry>
- <term>integrated</term>
- <listitem><para>indicates <emphasis>integrated</emphasis> deployment
- of the CAS Processor. The CPM deploys and collocates the CAS Processor in the
- same process space as the CPM. This type of deployment is recommended to
- increase the performance of the CPE. However, it is NOT recommended to
- deploy annotators containing JNI this way. Such CAS Processors may cause a
- fatal exception and force the JVM to exit without cleanup (bringing down the
- CPM). Any UIMA SDK compliant pure Java CAS Processors may be safely deployed
- this way.</para>
- <para>The descriptor for an integrated deployment can, in fact, be a remote
- service descriptor. When used this way, however, the CPM error recovery
- options (see below) operate in the integrated mode, which means that many
- of the retry options are not available.</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>remote</term>
- <listitem><para>indicates <emphasis>non-managed</emphasis>
- deployment of the CAS Processor. The CAS Processor descriptor referenced
- in the <literal><descriptor></literal> element must be a Vinci
- <emphasis>Service Client Descriptor</emphasis>, which identifies a
- remotely deployed CAS Processor service (see <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application.remote_services"/>). The CPM
- assumes that the CAS Processor is already running as a remote service and
- will connect to it using the URI provided in the client service descriptor.
- The lifecycle of a remotely deployed CAS Processor is not managed by the CPM,
- so appropriate infrastructure should be in place to start/restart such CAS
- Processors when necessary. This deployment provides fault isolation and
- is implementation (i.e., programming language) neutral.</para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>local</term>
- <listitem><para>indicates <emphasis>managed</emphasis> deployment of
- the CAS Processor. The CAS Processor descriptor referenced in the
- <literal><descriptor></literal> element must be a Vinci
- <emphasis>Service Deployment Descriptor</emphasis>, which configures
- a CAS Processor for deployment as a Vinci service (see <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application.remote_services"/>). The CPM
- deploys the CAS Processor in a separate process and manages the life cycle
- (start/stop) of the CAS Processor. Communication between the CPM and the
- CAS Processor is done with Vinci. When the CPM completes processing, the
- process containing the CAS Processor is terminated. This deployment mode
- insulates the CPM from the CAS Processor, creating a more robust deployment
- at the cost of a small communication overhead. On multiprocessor machines,
- the separate processes may run concurrently and improve overall
- throughput.</para></listitem>
- </varlistentry>
- </variablelist></para>
-
- <para>A number of elements may appear within the
- <literal><casProcessor></literal> element.</para>
-
- <section id="&tp;descriptor.cas_processors.individual.descriptor">
- <title><descriptor> Element</title>
-
- <para>The <literal><descriptor></literal> element is mandatory. It
- identifies the descriptor for the referenced CAS Processor using the syntax
- described in <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor.aes"/>.
-
- <itemizedlist spacing="compact"><listitem><para>For
- <emphasis><literal>remote</literal></emphasis> CAS Processors, the
- referenced descriptor must be a Vinci <emphasis>Service Client
- Descriptor</emphasis>, which identifies a remotely deployed CAS Processor
- service.</para></listitem>
-
- <listitem><para>For <emphasis>local</emphasis> CAS Processors, the
- referenced descriptor must be a Vinci <emphasis>Service Deployment
- Descriptor</emphasis>.</para></listitem>
-
- <listitem><para>For <emphasis>integrated</emphasis> CAS Processors,
- the referenced descriptor must be an Analysis Engine Descriptor
- (primitive or aggregate). </para></listitem></itemizedlist> </para>
-
- <para>See <olink targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.application.remote_services"/> for more
- information on creating these descriptors and deploying services.</para>
-
- </section>
-
- <section
- id="&tp;descriptor.cas_processors.individual.configuration_parameter_settings">
- <title><configurationParameterSettings> Element</title>
-
- <para>This element provides a way to override the contained Analysis
- Engine's parameters settings. Any entry specified here must already be
- defined; values specified replace the corresponding values for each
- parameter. <emphasis role="bold-italic">For Cas Processors, this mechanism
- is only available when they are deployed in <quote>integrated</quote>
- mode.</emphasis> For Collection Readers and Initializers, it always is
- available.</para>
-
- <para>The content of this element is identical to the component descriptor for
- specifying parameters (in the case where no parameter groups are
- specified)<footnote><para>An earlier UIMA version required these to have a
- suffix of <quote>_p</quote>, e.g., <quote>string_p</quote>. This is no
- longer required, but this format is accepted, also, for backward
- compatibility.</para></footnote>. Here is an example:
-
-
- <programlisting><![CDATA[<configurationParameterSettings>
- <nameValuePair>
- <name>CivilianTitles</name>
- <value>
- <array>
- <string>Mr.</string>
- <string>Ms.</string>
- <string>Mrs.</string>
- <string>Dr.</string>
- </array>
- </value>
- </nameValuePair>
- ...
-</configurationParameterSettings>]]></programlisting></para>
-
- </section>
-
- <section
- id="&tp;descriptor.cas_processors.individual.sofa_name_mappings">
- <title><sofaNameMappings> Element</title>
-
- <para>This optional element provides a mapping from defined Sofa names in the
- component, or the default Sofa name (if the component does not declare any Sofa
- names). The form of this element is:
-
-
- <programlisting><sofaNameMappings>
- <sofaNameMapping cpeSofaName="a_CPE_name"
- componentSofaName="a_component_Name"/>
- ...
-</sofaNameMappings></programlisting></para>
-
- <para>There can be any number of<literal>
- <sofaNameMapping></literal> elements contained in the
- <literal><sofaNameMappings></literal> element. The
- <literal>componentSofaName</literal> attribute is optional; leave it out to
- specify a mapping for the <literal>_InitialView</literal> - that is, for
- Single-View components.</para>
-
- </section>
-
- <section id="&tp;descriptor.cas_processors.run_in_separate_process">
- <title><runInSeparateProcess> Element</title>
-
- <para>The <literal><runInSeparateProcess></literal> element is
- mandatory for <literal>local</literal> CAS Processors, but should not appear
- for <literal>remote</literal> or <literal>integrated</literal> CAS
- Processors. It enables the CPM to create external processes using the provided
- runtime environment. Applications launched this way communicate with the CPM
- using the Vinci protocol and connectivity is enabled by a local instance of the
- VNS that the CPM manages. Since communication is based on Vinci, the application
- need not be implemented in Java. Any language for which Vinci provides support
- may be used to create an application, and the CPM will seamlessly communicate
- with it. The overall structure of this element is:
-
-
- <programlisting><![CDATA[<runInSeparateProcess>
- <exec dir="[String]" executable="[String]">
- <env key="[String]" value ="[String]"/>
- ...
- <arg>[String]</arg>
- ...
- </exec>
-</runInSeparateProcess>]]></programlisting></para>
-
- <para>The <literal><exec></literal> element provides information
- about how to execute the referenced CAS Processor. Two attributes are defined
- for the <literal><exec></literal> element. The
- <literal>dir</literal> attribute is currently not used – it is reserved
- for future functionality. The <literal>executable</literal> attribute
- specifies the actual Vinci service executable that will be run by the CPM, e.g.,
- <literal>java</literal>, a batch script, an application (.exe), etc. The
- executable must be specified with a fully qualified path, or be found in the
- <literal>PATH</literal> of the CPM.</para>
-
- <para>The <literal><exec></literal> element has two elements within it
- that define parameters used to construct the command line for executing the CAS
- Processor. These elements must be listed in the order in which they should be
- defined for the CAS Processor.</para>
-
- <para>The optional <literal><env></literal> element is used to set an
- environment variable. The variable <literal>key</literal> will be set to
- <literal>value</literal>. For example,
-
-
- <programlisting><env key="CLASSPATH" value="C:Javalib"/></programlisting>
- will set the environment variable <literal>CLASSPATH</literal> to the value
- <literal>C:Javalib</literal>. The <literal><env></literal>
- element may be repeated to set multiple environment variables. All of the
- key/value pairs will be added to the environment by the CPM prior to launching the
- executable.</para>
- <note><para>The CPM actually adds ALL system environment variables when it
- launches the program. It queries the Operating System for its current system
- variables and one by one adds them to the program's process
- configuration.</para></note>
-
- <para>The <literal><arg></literal> element is used to specify arbitrary
- string arguments that will appear on the command line when the CPM runs the
- command specified in the <literal>executable</literal> attribute.</para>
-
- <para>For example, the following would be used to invoke the UIMA Java
- implementation of the Vinci service wrapper on a Java CAS Processor:
-
-
- <programlisting><![CDATA[<runInSeparateProcess>
- <exec executable="java">
- <arg>-DVNS_HOST=localhost</arg>
- <arg>-DVNS_PORT=9099</arg>
- <arg>org.apache.uima.reference_impl.analysis_engine.service.
-vinci.VinciAnalysisEngineService_impl</arg>
- <arg>C:uimadescdeployCasProcessor.xml</arg>
- </exec>
-<runInSeparateProcess>]]></programlisting></para>
-
- <para>This will cause the CPM to run the following command line when starting the
- CAS Processor:
-
-
- <programlisting>java -DVNS_HOST=localhost -DVNS_PORT=9099
- org.apache.uima.reference_impl.analysis_engine.service.vinci.\\
- VinciAnalysisEngineService_impl
- C:uimadescdeployCasProcessor.xml</programlisting></para>
-
- <para>The first argument specifies that the Vinci Naming Service is running on the
- <literal>localhost</literal>. The second argument specifies that the Vinci
- Naming Service port number is <literal>9099</literal>. The third argument
- (split over 2 lines in this documentation)
- identifies the UIMA implementation of the Vinci service wrapper. This class
- contains the <literal>main</literal> method that will execute. That main
- method in turn takes a single argument – the filename for the CAS Processor
- service deployment descriptor. Thus the last argument identifies the Vinci
- service deployment descriptor file for the CAS Processor. Since this is the same
- descriptor file specified earlier in the
- <literal><descriptor></literal> element, the string
- <literal>${descriptor}</literal> can be used to refer to the descriptor,
- e.g.:
-
-
- <programlisting><arg>${descriptor}</arg></programlisting></para>
-
- <para>The CPM will expand this out to the service deployment descriptor file
- referenced in the <literal><descriptor></literal> element.</para>
-
- </section>
-
- <section
- id="&tp;descriptor.cas_processors.individual.deployment_parameters">
- <title><deploymentParameters> Element</title>
-
- <para>The <literal><deploymentParameters></literal> element defines
- a number of deployment parameters that control how the CPM will interact with the
- CAS Processor. This element has the following overall form:
-
-
- <programlisting><deploymentParameters>
- <parameter name="[String]" value="..." type="string|integer" />
- ...
-</deploymentParameters></programlisting></para>
-
- <para>The <literal>name</literal> attribute identifies the parameter, the
- <literal>value</literal> attribute specifies the value that will be assigned
- to the parameter, and the <literal>type</literal> attribute indicates the
- type of the parameter, either <literal>string</literal> or
- <literal>integer</literal>. The available parameters include:
-
- <variablelist>
-
- <varlistentry>
- <term>service-access</term>
- <listitem><para>string parameter whose value must be
- <quote>exclusive</quote>, if present. This parameter is only
- effective for remote deployments. It modifies the Vinci service
- connections to be preallocated and dedicated, one service instance per
- pipe-line. It is only relevant for non-Integrated deployement modes. If
- there are fewer services instances that are available (and alive –
- responding to a <quote>ping</quote> request) than there are pipelines,
- the number of pipelines (the number of concurrent threads) is reduced to
- match the number of available instances. If not specified, the VNS is
- queried each time a service is needed, and a <quote>random</quote>
- instance is assigned from the pool of available instances. If a services
- dies during processing, the CPM will use its normal error handling
- procedures to attempt to reconnect. The number of attempts is specified
- in the CPE descriptor for each Cas Processor using the
- <literal><maxConsecutiveRestarts value="10"
- action="kill-pipeline"
- waitTimeBetweenRetries="50"/></literal> xml element. The
- <quote>value</quote> attribute is the number of reconnection tries;
- the <quote>action</quote> says what to do if the retries exceed the
- limit. The <quote>kill-pipeline</quote> action stops the pipeline
- that was associated with the failing service (other pipelines will
- continue to work). The CAS in process within a killed pipeline will be
- dropped. These events are communicated to the application using the
- normal event listener mechanism. The
- <literal>waitTimeBetweenRetries</literal> says how many
- milliseconds to wait inbetween attempts to reconnect.</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>vnsHost</term>
- <listitem><para>(Deprecated) string parameter specifying the VNS host,
- e.g., <literal>localhost</literal> for local CAS Processors, host
- name or IP address of VNS host for remote CAS Processors. This parameter is
- deprecated; use the parameter specification instead inside the Vinci
- <emphasis>Service Client Descriptor</emphasis>, if needed. It is
- ignored for integrated and local deployments. If present, for remote
- deployments, it specifies the VNS Host to use, unless that is specified in
- the Vinci <emphasis>Service Client Descriptor</emphasis>.</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>vnsPort</term>
- <listitem><para>(Deprecated) integer parameter specifying the VNS port
- number. This parameter is deprecated; use the parameter specification
- instead inside the Vinci <emphasis>Service Client
- Descriptor,</emphasis> if needed. It is ignored for integrated and
- local deployments. If present, for remote deployments, it specifies the
- VNS Port number to use, unless that is specified in the Vinci
- <emphasis>Service Client Descriptor.</emphasis></para>
- </listitem>
- </varlistentry>
- </variablelist></para>
-
- <para>For example, the following parameters might be used with a CAS Processor
- deployed in local mode:
-
-
- <programlisting><deploymentParameters>
- <parameter name="service-access" value="exclusive" type="string"/>
-</deploymentParameters></programlisting></para>
-
- </section>
-
- <section id="&tp;descriptor.cas_processors.individual.filter">
- <title><filter> Element</title>
-
- <para>The <filter> element is a required element but currently should be
- left empty. This element is reserved for future use.</para>
-
- </section>
-
- <section id="&tp;descriptor.cas_processors.individual.error_handling">
- <title><errorHandling> Element</title>
-
- <para>The mandatory <literal><errorHandling></literal> element
- defines error and restart policies for the CAS Processor. Each CAS Processor may
- define different actions in the event of errors and restarts. The CPM monitors
- and logs errant behaviors and attempts to recover the component based on the
- policies specified in this element.</para>
-
- <para>There are two kinds of faults:
-
- <orderedlist><listitem><para>One kind only occurs with non-integrated CAS
- Processors – this fault is either a timeout attempting to launch or
- connect to the non-integrated component, or some other kind of connection
- related exception (for instance, the network connection might timeout or get
- reset).</para></listitem>
-
- <listitem><para>The other kind happens when the CAS Processor component (an
- Annotator, for example) throws any kind of exception. This kind may occur
- with any kind of deployment, integrated or not. </para></listitem>
- </orderedlist></para>
-
- <para>The <errorHandling> has specifications for each of these kinds of
- faults. The format of this element is:
-
-
- <programlisting><![CDATA[<errorHandling>
- <maxConsecutiveRestarts action="continue|disable|terminate"
- value="[Number]"/>
- <errorRateThreshold action="continue|disable|terminate" value="[Rate]"/>
- <timeout max="[Number]"/>
-</errorHandling>]]></programlisting></para>
-
- <para>The mandatory <literal><maxConsecutiveRestarts></literal>
- element applies only to faults of the first kind, and therefore, only applies to
- non-integrated deployments. If such a fault occurs, a retry is attempted, up to
- <literal>value="[Number]"</literal> of times. This retry resets the
- connection (if one was made) and attempts to reconnect and perhaps re-launch
- (see below for details). The original CAS (not a partially updated one) is sent to
- the CAS Processor as part of the retry, once the deployed component has been
- successfully restarted or reconnected to.</para>
-
- <para>The <literal>action</literal> attribute specifies the action to take
- when the threshold specified by the <literal>value="[Number]"</literal> is
- exceeded. The possible actions are:
-
- <variablelist>
- <varlistentry>
- <term>continue</term>
- <listitem><para>skip any further processing for this CAS by this CAS
- Processor, and pass the CAS to the next CAS Processor in the Pipeline.
- </para>
- <para>The <quote>restart</quote> action is done, because it is needed
- for the next CAS.</para>
-
- <para>If the <literal>dropCasOnException="true"</literal>, the CPM
- will NOT pass the CAS to the next CAS Processor in the chain. Instead, the
- CPM will abort processing of this CAS, release the CAS back to the CAS
- Pool and will process the next CAS in the queue.</para>
-
- <para>The counter counting the restarts toward the threshold is only
- reset after a CAS is successfully processed.</para></listitem>
- </varlistentry>
-
- <varlistentry>
- <term>disable</term>
- <listitem><para>the current CAS is handled just as in the
- <literal>continue</literal> case, but in addition, the CAS Processor
- is marked so that its <emphasis>process()</emphasis> method will not be
- called again (i.e., it will be <quote>skipped</quote> for future
- CASes)</para></listitem>
- </varlistentry>
-
- <varlistentry>
- <term>terminate</term>
- <listitem><para>the CPM will terminate all processing and exit.</para>
- </listitem>
- </varlistentry>
- </variablelist></para>
-
- <para>The definition of an error for the
- <literal><maxConsecutiveRestarts></literal> element differs
- slightly for each of the three CAS Processor deployment modes:
- <variablelist>
- <varlistentry>
- <term>local</term>
- <listitem><para>Local CAS Processors experience two general error
- types:
- <itemizedlist>
- <listitem><para>launch errors – errors associated with
- launching a process</para></listitem>
- <listitem><para>processing errors – errors associated with
- sending Vinci commands to the process</para></listitem>
- </itemizedlist></para>
-
- <para>A launch error is defined by a failure of the process to
- successfully register with the local VNS within a default time window.
- The current timeout is 15 minutes. Multiple local CAS Processors are
- launched sequentially, with a subsequent processor launched
- immediately after its previous processor successfully registers
- with the VNS.</para>
-
- <para>A processing error is detected if a connection to the CAS Processor
- is lost or if the processing time exceeds a specified timeout
- value.</para>
-
- <para>For local CAS Processors, the
- <maxConsecutiveRestarts> element specifies the number of
- consecutive attempts made to launch the CAS Processor at CPM startup or
- after the CPM has lost a connection to the CAS Processor.</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>remote</term>
- <listitem><para>For remote CAS Processors, the
- <maxConsecutiveRestarts> element applies to errors from
- sending Vinci commands. An error is detected if a connection to the CAS
- Processor is lost, or if the processing time exceeds the timeout value
- specified in the <timeout> element (see below).</para>
- </listitem>
- </varlistentry>
-
- <varlistentry>
- <term>integrated</term>
- <listitem><para>Although mandatory, the
- <maxConsecutiveRestarts> element is NOT used for integrated CAS
- Processors, because Integrated CAS Processors are not
- re-instantiated/restarted on exceptions. This setting is ignored by
- the CPM for Integrated CAS Processors but it is required. Future version
- of the CPM will make this element mandatory for remote and local CAS
- Processors only.</para></listitem>
- </varlistentry>
-
- </variablelist></para>
-
- <para>The mandatory <literal><errorRateThreshold></literal> element
- is used for all faults – both those above, and exceptions thrown by the CAS
- Processor itself. It specifies the number of retries for exceptions thrown by
- the CAS Processor itself, a maximum error rate, and the corresponding action to
- take when this rate is exceeded. The <literal>value</literal> attribute
- specifies the error rate in terms of errors per sample size in the form
- <quote><literal>N/M</literal></quote>, where <literal>N</literal> is the
- number of errors and <literal>M</literal> is the sample size, defined in terms
- of the number of documents.</para>
-
- <para>The first number is used also to indicate the maximum number of retries. If
- this number is less than the <literal><maxConsecutiveRestarts
- value="[Number]">, </literal>it will override, reducing the number of
- <quote>restarts</quote> attempted. A retry is done only if the
- <literal>dropCasOnException </literal>is false. If it is set to true, no retry
- occurs, but the error is counted.</para>
-
- <para>When the number of counted errors exceeds the sample size, an action
- specified by the <literal>action</literal> attribute is taken. The possible
- actions and their meaning are the same as described above for the
- <literal><maxConsecutiveRestarts></literal> element:
- <itemizedlist spacing="compact">
- <listitem><para><literal>continue</literal></para></listitem>
- <listitem><para><literal>disable</literal></para></listitem>
- <listitem><para><literal>terminate</literal></para></listitem>
- </itemizedlist></para>
-
- <para>The <literal>dropCasOnException="true"</literal> attribute of the
- <literal><casProcessors></literal> element modifies the action
- taken for continue and disable, in the same manner as above. For example:
-
-
- <programlisting><errorRateThreshold value="3/1000" action="disable"/></programlisting>
- specifies that each error thrown by the CAS Processor itself will be retried up to
- 3 times (if <literal>dropCasOnException</literal> is false) and the CAS
- Processor will be disabled if the error rate exceeds 3 errors in 1000
- documents.</para>
-
- <para>If a document causes an error and the error rate threshold for the CAS
- Processor is not exceeded, the CPM increments the CAS Processor's error
- count and retries processing that document (if
- <literal>dropCasOnException</literal> is false). The retry means that the
- CPM calls the CAS Processor's process() method again, passing in as an
- argument the same CAS that previously caused an exception.</para>
- <note><para>The CPM does not attempt to rollback any partial changes that may have
- been applied to the CAS in the previous process() call. </para></note>
-
- <para>Errors are accumulated across documents. For example, assume the error
- rate threshold is <literal>3/1000</literal>. The same document may fail three
- times before finally succeeding on the fourth try, but the error count is now 3. If
- one more error occurs within the current sample of 1000 documents, the error rate
- threshold will be exceeded and the specified action will be taken. If no more
- errors occur within the current sample, the error counter is reset to 0 for the
- next sample of 1000 documents.</para>
-
- <para>The <literal><timeout></literal> element is a mandatory element.
- Although mandatory for all CAS Processors, this element is only relevant for
- local and remote CAS Processors. For integrated CAS Processors, this element is
- ignored. In the current CPM implementation the integrated CAS Processor
- process() method is not subject to timeouts.</para>
-
- <para>The <literal>max</literal> attribute specifies the maximum amount of
- time in milliseconds the CPM will wait for a process() method to complete When
- exceeded, the CPM will generate an exception and will treat this as an error
- subject to the threshold defined in the
- <literal><errorRateThreshold></literal> element above, including
- doing retries.</para>
-
- <section
- id="&tp;descriptor.cas_processors.individual.error_handling.timeout_retry_action">
- <title>Retry action taken on a timeout</title>
-
- <para>The action taken depends on whether the CAS Processor is local (managed)
- or remote (unmanaged). Local CAS Processors (which are services) are killed
- and restarted, and a new connection to them is established. For remote CAS
- Processors, the connection to them is dropped, and a new connection is
- reestablished (which may actually connect to a different instance of the
- remote services, if it has multiple instances).</para>
- </section>
- </section>
-
- <section id="&tp;descriptor.cas_processors.individual.checkpoint">
- <title><checkpoint> Element</title>
-
- <para>The <literal><checkpoint></literal> element is an optional
- element used to improve the performance of CAS Consumers. It has a single
- attribute, <literal>batch</literal>, which specifies the number of CASes in a
- batch, e.g.:
-
-
- <programlisting><checkpoint batch="1000"></programlisting></para>
-
- <para>sets the batch size to 1000 CASes. The batch size is the interval used to mark a
- point in processing requiring special handling. The CAS Processor's
- <literal>batchProcessComplete()</literal> method will be called by the CPM
- when this mark is reached so that the processor can take appropriate action. This
- mark could be used as a mechanism to buffer up results in CAS Consumers and perform
- time-consuming operations, such as check-pointing, that should not be done on a
- per-document basis.</para>
-
- </section>
- </section>
- </section>
-
- <section id="&tp;descriptor.operational_parameters">
- <title>CPE Operational Parameters</title>
-
- <para>The parameters for configuring the overall CPE and CPM are specified in the
- <literal><cpeConfig></literal> section. The overall format of this
- section is:
-
-
- <programlisting><![CDATA[<cpeConfig>
- <startAt>[NumberOrID]</startAt>
-
- <numToProcess>[Number]</numToProcess>
-
- <outputQueue dequeueTimeout="[Number]" queueClass="[ClassName]" />
-
- <checkpoint file="[File]" time="[Number]" batch="[Number]"/>
-
- <timerImpl>[ClassName]</timerImpl>
-
- <deployAs>vinciService|interactive|immediate|single-threaded
- </deployAs>
-
-</cpeConfig>]]></programlisting></para>
-
- <para>This section of the CPE descriptor allows for defining the starting entity, the
- number of entities to process, a checkpoint file and frequency, a pluggable timer, an
- optional output queue implementation, and finally a mode of operation. The mode of
- operation determines how the CPM interacts with users and other systems.</para>
-
- <para>The <literal><startAt></literal> element is an optional argument. It
- defines the starting entity in the collection at which the CPM should start
- processing.</para>
-
- <para>The implementation in the CPM passes this argument to the Collection Reader
- as the value of the parameter <quote><literal>startNumber</literal></quote>.
- The CPM does not do anything else with this parameter; in particular, the CPM has no
- ability to skip to a specific document - that function, if available, is only provided
- by a particular Collection Reader implementation.</para>
-
- <para>If the <literal><startAt></literal> element is used, the Collection
- Reader descriptor must define a single-valued configuration parameter with the
- name <literal>startNumber</literal>. It can declare this value to be of any type;
- the value passed in this XML element must be convertible to that type.</para>
-
- <para>A typical use is to declare this to be an integer type, and to pass the sequential
- document number where processing should start. An alternative implementation
- might take a specific document ID; the collection reader could search through its
- collection until it reaches this ID and then start there.</para>
-
- <para>This parameter will only make sense if the particular collection reader is
- implemented to use the <literal>startNumber</literal> configuration
- parameter.</para>
-
- <para>The <literal><numToProcess></literal> element is an optional
- element. It specifies the total number of entities to process. Use -1 to indicate ALL.
- If not defined, the number of entities to process will be taken from the Collection
- Reader configuration. If present, this value overrides the Collection Reader
- configuration.</para>
-
- <para>The <literal><outputQueue></literal> element is an optional element.
- It enables plugging in a custom implementation for the Output Queue. When omitted,
- the CPM will use a default output queue that is based on First-in First-out (FIFO)
- model.</para>
-
- <para>The UIMA SDK provides a second implementation for the Output Queue that can be
- plugged in to the CPM, named <quote>
- <literal>org.apache.uima.collection.impl.cpm.engine.SequencedQueue</literal>
- </quote>.</para>
-
- <para>This implementation supports handling very large documents that are split into
- <quote>chunks</quote>; it provides a delivery mechanism that insures the
- sequential order of the chunks using information carried in the CAS metadata. This
- metadata, which is required for this implementation to work correctly, must be added
- as an instance of a Feature Structure of type
- <literal>org.apache.es.tt.DocumentMetaData</literal> and referred to by an
- additional feature named <literal>esDocumentMetaData</literal> in the special
- instance of <literal>uima.tcas.DocumentAnnotation</literal> that is
- associated with the CAS. This is usually done by the Collection Reader; the instance
- contains the following features:
-
- <variablelist>
- <varlistentry>
- <term>sequenceNumber</term>
- <listitem><para>[Number] the sequential number of a chunk, starting at 1. If
- not a chunk (i.e. complete document), the value should be 0.</para>
- </listitem>
- </varlistentry>
- <varlistentry>
- <term>documentId</term>
- <listitem><para>[Number] current document id. Chunks belonging to the same
- document have identical document id.</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>isCompleted</term>
- <listitem><para>[Number] 1 if the chunk is the last in a sequence, 0
- otherwise.</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>url</term>
- <listitem><para>[String] document url.</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>throttleID</term>
- <listitem><para>[String] special attribute currently used by
- OmniFind.</para></listitem>
- </varlistentry>
- </variablelist></para>
-
- <para>This implementation of a sequenced queue supports proper sequencing of CASes in
- CPM deployments that use document chunking. Chunking is a technique of splitting
- large documents into pieces to reduce overall memory consumption. Chunking does not
- depend on the number of CASes in the CAS Pool. It works equally well with one or more
- CASes in the CAS Pool. Each chunk is packaged in a separate CAS and placed in the Work
- Queue. If the CAS Pool is depleted, the CollectionReader thread is suspended until a
- CAS is released back to the pool by the processing threads. A document may be split into
- 1, 2, 3 or more chunks that are analyzed independently. In order to reconstruct the
- document correctly, the CAS Consumer can depend on receiving the chunks in the same
- sequential order that the chunks were <quote>produced</quote>, when this
- sequenced queue implementation is used. To plug in this sequenced queue to the CPM use
- the following specification:
-
-
- <programlisting><outputQueue dequeueTimeout="100000" queueClass=
-"org.apache.uima.collection.impl.cpm.engine.SequencedQueue"/></programlisting>
-
- where the mandatory <literal>queueClass</literal> attribute defines the name of
- the class and the second mandatory attribute, <literal>dequeueTimeout</literal>
- specifies the maximum number of milliseconds to wait for the expected chunk.</para>
-
- <note><para>The value for this timeout must be carefully determined to avoid
- excessive occurrences of timeouts. Typically, the size of a chunk and the type of
- analysis being done are the most important factors when deciding on the value for the
- timeout. The larger the chunk and the more complicated analysis, the more time it takes
- for the chunk to go from source to sink. You may specify 0, in which case, the timeout is
- disabled - i.e., it is equivalent to an infinitely long timeout.</para></note>
-
- <para>If the chunk doesn't arrive in the configured time window, the entire
- document is presumed to be invalid and the CAS is dropped from further processing.
- This action occurs regardless of any other error action specification. The
- SequencedQueue invalidate the document, adding the offending document's
- metadata to a local cache of invalid documents. </para>
-
- <para>If the time out occurs, the CPM notifies all registered listeners (see <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.cpe.using_listeners"/>) by calling
- entityProcessComplete(). As part of this call, the SequencedQueue will pass null
- instead of a CAS as the first argument, and a special exception –
- CPMChunkTimeoutException. The reason for passing null as the first argument is
- because the time out occurs due to the fact that the chunk has not been received in the
- configured timeout window, so there is no CAS available when the timeout event
- occurs.</para>
-
- <para>The CPMChunkTimeoutException object includes an API that allows the listener
- to retrieve the offending document id as well as the other metadata attributes as
- defined above. These attributes are part of each chunk's metadata and are added
- by the Collection Reader.</para>
-
- <para>Each chunk that SequencedQueue works on is subjected to a test to determine if the
- chunk belongs to an invalid document. This test checks the chunk's metadata
- against the data in the local cache. If there is a match, the chunk is dropped. This
- check is only performed for chunks and complete documents are not subject to this
- check.</para>
-
- <para>If there is an exception during the processing of a chunk, the CPM sends a
- notification to all registered listeners. The notification includes the CAS and an
- exception. When the listener notification is completed, the CPM also sends separate
- notifications, containing the CAS, to the Artifact Producer and the
- SequencedQueue. The intent is to stop adding new chunks to the Work Queue that belong
- to an <quote>invalid</quote> document and also to deal with chunks that are
- en-route, being processed by the processing threads.</para>
-
- <para>In response to the notification, the Artifact Producer will drop and release
- back to the CAS Pool all CASes that belong to an <quote>invalid</quote> document.
- Currently, there is no support in the CollectionReader's API to tell it to stop
- generating chunks. The CollectionReader keeps producing the chunks but the
- Artifact Producer immediately drops/releases them to the CAS Pool. Before the CAS is
- released back to the CAS Pool, the Artifact Producer sends notification to all
- registered listeners. This notification includes the CAS and an exception –
- SkipCasException.</para>
-
- <para>In response to the notification of an exception involving a chunk, the
- SequencedQueue retrieves from the CAS the metadata and adds it to its local cache of
- <quote>invalid</quote> documents. All chunks de-queued from the OutputQueue and
- belonging to <quote>invalid</quote> documents will be dropped and released back to
- the CAS Pool. Before dropping the CAS, the CPM sends notification to all registered
- listeners. The notification includes the CAS and SkipCasException.</para>
-
- <para>The <literal><checkpoint></literal> element is an optional element.
- It specifies a CPE checkpoint file, checkpoint frequency, and strategy for
- checkpoints (time or count based). At checkpoint time, the CPM saves status
- information and statistics to the checkpoint file. The checkpoint file is specified
- in the <literal>file</literal> attribute, which has the same form as the
- <literal>href</literal> attribute of the <literal><include></literal>
- element described in <xref linkend="&tp;imports"/>. The
- <literal>time</literal> attribute indicates that a checkpoint should be taken
- every <literal>[Number]</literal> seconds, and the <literal>batch</literal>
- attribute indicates that a checkpoint should be taken every
- <literal>[Number]</literal> batches.</para>
-
- <para>The <literal><timerImpl></literal> element is optional. It is used to
- identify a custom timer plug-in class to generate time stamps during the CPM
- execution. The value of the element is a Java class name.</para>
-
- <para>The <literal><deployAs></literal> element indicates the type of CPM
- deployment. Valid contents for this element include:
-
- <variablelist>
- <varlistentry>
- <term>vinciService</term>
- <listitem><para>Vinci service exposing APIs for stop, pause, resume, and
- getStats</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>interactive</term>
- <listitem><para>provide command line menus (start, stop, pause,
- resume)</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>immediate</term>
- <listitem><para>run the CPM without menus or a service API</para></listitem>
- </varlistentry>
- <varlistentry>
- <term>single-threaded</term>
- <listitem><para>run the CPM in a single threaded mode. In this mode, the
- Collection Reader, the Processing Pipeline, and the CAS Consumer Pipeline
- are all running in one thread without the work queue and the output
- queue.</para></listitem>
- </varlistentry>
- </variablelist></para>
-
- </section>
-
- <section id="&tp;descriptor.resource_manager_configuration">
- <title>Resource Manager Configuration</title>
-
- <para>External resource bindings for the CPE may optionally be specified in an
- element:
-
-
- <programlisting><resourceManagerConfiguration href="..."/></programlisting></para>
-
- <para>For an introduction to external resources, refer to <olink
- targetdoc="&uima_docs_tutorial_guides;"
- targetptr="ugr.tug.aae.accessing_external_resource_files"/>.</para>
-
- <para>In the <literal>resourceManagerConfiguration</literal> element, the value
- of the href attribute refers to another file that contains definitions and bindings
- for the external resources used by the CPE. The format of this file is the same as the XML
- snippet <olink targetdoc="&uima_docs_ref;"
- targetptr="ugr.ref.xml.component_descriptor.aes.aggregate.external_resource_bindings"/>
- . For example, in a CPE containing an aggregate analysis engine with two annotators,
- and a CAS Consumer, the following resource manager configuration file would bind
- external resource dependencies in all three components to the same physical
- resource:
-
-
- <programlisting><![CDATA[<resourceManagerConfiguration>
-
- <!-- Declare Resource -->
-
- <externalResources>
- <externalResource>
- <name>ExampleResource</name>
- <fileResourceSpecifier>
- <fileUrl>file:MyResourceFile.dat</fileUrl>
- </fileResourceSpecifier>
- </externalResource>
- </externalResources>
-
- <!-- Bind component resource dependencies to ExampleResource -->
-
- <externalResourceBindings>
- <externalResourceBinding>
- <key>MyAE/annotator1/myResourceKey</key>
- <resourceName>ExampleResource</resourceName>
- </externalResourceBinding>
-
- <externalResourceBinding>
- <key>MyAE/annotator2/someResourceKey</key>
- <resourceName>ExampleResource</resourceName>
- </externalResourceBinding>
-
- <externalResourceBinding>
- <key>MyCasConsumer/otherResourceKey</key>
- <resourceName>ExampleResource</resourceName>
- </externalResourceBinding>
-
- </externalResourceBindings>
-
-</resourceManagerConfiguration>]]></programlisting></para>
-
- <para>In this example, <literal>MyAE</literal> and
- <literal>MyCasConsumer</literal> are the names of the Analysis Engine and CAS
- Consumer, as specified by the name attributes of the CPE's
- <literal><casProcessor></literal> elements.
- <literal>annotator1</literal> and <literal>annotator2</literal> are the
- annotator keys specified within the Aggregate AE Descriptor, and
- <literal>myResourceKey</literal>, <literal>someResourceKey</literal>, and
- <literal>otherResourceKey</literal> are the keys of the resource dependencies
- declared in the individual annotator and CAS Consumer descriptors.</para>
-
- </section>
-
- <section id="&tp;descriptor.example">
- <title>Example CPE Descriptor</title>
-
-
- <programlisting><![CDATA[<?xml version="1.0" encoding="UTF-8"?>
-<cpeDescription>
- <collectionReader>
- <collectionIterator>
- <descriptor>
- <import location=
- "../collection_reader/FileSystemCollectionReader.xml"/>
- </descriptor>
- </collectionIterator>
- </collectionReader>
- <casProcessors dropCasOnException="true" casPoolSize="1"
- processingUnitThreadCount="1">
- <casProcessor deployment="integrated"
- name="Aggregate TAE - Name Recognizer and Person Title Annotator">
- <descriptor>
- <import location=
- "../analysis_engine/NamesAndPersonTitles_TAE.xml"/>
- </descriptor>
- <deploymentParameters/>
- <filter/>
- <errorHandling>
- <errorRateThreshold action="terminate" value="100/1000"/>
- <maxConsecutiveRestarts action="terminate" value="30"/>
- <timeout max="100000"/>
- </errorHandling>
- <checkpoint batch="1"/>
- </casProcessor>
- <casProcessor deployment="integrated" name="Annotation Printer">
- <descriptor>
- <import location="../cas_consumer/AnnotationPrinter.xml"/>
- </descriptor>
- <deploymentParameters/>
- <filter/>
- <errorHandling>
- <errorRateThreshold action="terminate" value="100/1000"/>
- <maxConsecutiveRestarts action="terminate" value="30"/>
- <timeout max="100000"/>
- </errorHandling>
- <checkpoint batch="1"/>
- </casProcessor>
- </casProcessors>
- <cpeConfig>
- <numToProcess>1</numToProcess>
- <deployAs>immediate</deployAs>
- <checkpoint file="" time="3000"/>
- <timerImpl/>
- </cpeConfig>
-</cpeDescription>]]></programlisting>
- </section>
-
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.5//EN"
+"http://www.oasis-open.org/docbook/xml/4.5/docbookx.dtd"[
+<!ENTITY imgroot "../images/references/ref.xml.cpe_descriptor/">
+<!ENTITY tp "ugr.ref.xml.cpe_descriptor.">
+<!ENTITY % uimaents SYSTEM "../entities.ent" >
+%uimaents;
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<chapter id="ugr.ref.xml.cpe_descriptor">
+ <title>Collection Processing Engine Descriptor Reference</title>
+ <titleabbrev>CPE Descriptor Reference</titleabbrev>
+
+ <para>A UIMA <emphasis>Collection Processing Engine</emphasis> (CPE) is a combination
+ of UIMA components assembled to analyze a collection of artifacts. A CPE is an
+ instantiation of the UIMA <emphasis>Collection Processing Architecture</emphasis>,
+ which defines the collection processing components, interfaces, and APIs. A CPE is
+ executed by a UIMA framework component called the <emphasis>Collection Processing
+ Manager</emphasis> (CPM), which provides a number of services for deploying CPEs,
+ running CPEs, and handling errors.</para>
+
+ <para>A CPE can be assembled programmatically within a Java application, or it can be
+ assembled declaratively via a CPE configuration specification, called a CPE
+ Descriptor. This chapter describes the format of the CPE Descriptor.</para>
+
+ <para>Details about the CPE, including its function, sub-components, APIs, and related
+ tools, can be found in <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.cpe"/>. Here we briefly summarize the CPE to define terms and
+ provide context for the later sections that describe the CPE Descriptor.</para>
+
+ <section id="&tp;overview">
+ <title>CPE Overview</title>
+
+ <figure id="&tp;overview.fig.runtime">
+ <title>CPE Runtime Overview</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.8in" format="PNG"
+ fileref="&imgroot;image002.png"/>
+ </imageobject>
+ <textobject><phrase>CPE Runtime Overview diagram</phrase></textobject>
+ </mediaobject>
+ </figure>
+
+ <para>An illustration of the CPE runtime is shown in <xref
+ linkend="&tp;overview.fig.runtime"/>. Some of the CPE components, such as the
+ <emphasis>queues</emphasis> and <emphasis>processing pipelines</emphasis>, are
+ internal to the CPE, but their behavior and deployment may be configured using the CPE
+ Descriptor. Other CPE components, such as the <emphasis>Collection
+ Reader</emphasis> and <emphasis>CAS Processors</emphasis>, are defined and
+ configured externally from the CPE and then plugged in to the CPE to create the overall
+ engine. The parts of a CPE are:
+
+ <variablelist>
+ <varlistentry>
+ <term>Collection Reader</term>
+ <listitem><para>understands the native data collection format and iterates
+ over the collection producing subjects of analysis</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>CAS Initializer<footnote><para>Deprecated</para></footnote>
+ </term>
+ <listitem><para>initializes a CAS with a subject of analysis</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Artifact Producer</term>
+ <listitem><para>asynchronously pulls CASes from the Collection Reader,
+ creates batches of CASes and puts them into the work queue</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Work Queue</term>
+ <listitem><para>shared queue containing batches of CASes queued by the Artifact
+ Producer for analysis by Analysis Engines</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>B1-Bn</term>
+ <listitem><para>individual batches containing 1 or more CASes</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>AE1-AEn</term>
+ <listitem><para>Analysis Engines arranged by a CPE descriptor</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Processing Pipelines</term>
+ <listitem><para>each pipeline runs in a separate thread and contains a
+ replicated set of the Analysis Engines running in the defined sequence</para>
+ </listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>Output Queue</term>
+ <listitem><para>holds batches of CASes with analysis results intended for CAS
+ Consumers</para></listitem>
+ </varlistentry>
+
+ <varlistentry>
+ <term>CAS Consumers</term>
+ <listitem><para>perform collection level analysis over the CASes and extract
+ analysis results, e.g., creating indexes or databases</para></listitem>
+ </varlistentry>
+ </variablelist>
+ </para>
+ </section>
+
+ <section id="&tp;notation">
+ <title>Notation</title>
+
+ <para>CPE Descriptors are XML files. This chapter uses an informal notation to specify
+ the syntax of CPE Descriptors.</para>
+
+ <para>The notation used in this chapter is:
+
+ <itemizedlist><listitem><para>An ellipsis (...) inside an element body indicates
+ that the substructure of that element has been omitted (to be described in another
+ section of this chapter). An example of this would be:
+
+
+ <programlisting><collectionReader>
+...
+</collectionReader></programlisting></para>
+ </listitem>
+
+ <listitem><para>An ellipsis immediately after an element indicates that the
+ element type may be repeated arbitrarily many times. For example:
+
+
+ <programlisting><parameter>[String]</parameter>
+<parameter>[String]</parameter>
+...</programlisting>
+ indicates that there may be arbitrarily many parameter elements in this
+ context.</para></listitem>
+
+ <listitem><para>An ellipsis inside an element means details of the attributes
+ associated with that element are defined later, e.g.:
+
+ <programlisting><casProcessor ...></programlisting></para>
+ </listitem>
+
+ <listitem><para>Bracketed expressions (e.g. <literal>[String]</literal>)
+ indicate the type of value that may be used at that location.</para></listitem>
+
+ <listitem><para>A vertical bar, as in <literal>true|false</literal>, indicates
+ alternatives. This can be applied to literal values, bracketed type names, and
+ elements. </para></listitem></itemizedlist></para>
+
+ <para>Which elements are optional and which are required is specified in prose, not in the
+ syntax definition.</para>
+
+ </section>
+
+ <section id="&tp;imports">
+ <title>Imports</title>
+
+ <para>As of version 2.2, a CPE Descriptor can use the same <literal>import</literal> mechanism
+ as other component descriptors. This allows referring to component
+ descriptors using either relative paths (resolved relative to the location of the CPE descriptor)
+ or the classpath/datapath. For details see <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor"/>.</para>
+
+ <para>The follwing older syntax is still supported, but <emphasis>not recommended</emphasis>:
+
+ <programlisting><![CDATA[<descriptor>
+ <include href="[URL or File]"/>
+</descriptor>]]></programlisting></para>
+
+ <para>The <literal>[URL or File]</literal> attribute is a URL or a filename for the descriptor of the
+ incorporated component. The argument is first attempted to be resolved as a URL.</para>
+
+ <para>
+ Relative paths in an <literal>include</literal> are resolved relative to the current working directory
+ (NOT the CPE descriptor location as is the case for <literal>import</literal>).
+ A filename relative to another directory can be specified using the <literal>CPM_HOME</literal>
+ variable, e.g.,
+ <programlisting><descriptor>
+ <include href="${CPM_HOME}/desc_dir/descriptor.xml"/>
+</descriptor></programlisting>
+
+ In this case, the value for the <literal>CPM_HOME</literal> variable must be
+ provided to the CPE by specifying it on the Java command line, e.g.,
+
+ <programlisting>java -DCPM_HOME="C:/Program Files/apache/uima/cpm" ...</programlisting>
+
+ </para>
+
+ </section>
+
+ <section id="&tp;descriptor">
+ <title>CPE Descriptor Overview</title>
+
+ <para>A CPE Descriptor consists of information describing the following four main
+ elements.</para>
+
+ <orderedlist><listitem><para>The <emphasis>Collection Reader</emphasis>, which
+ is responsible for gathering artifacts and initializing the Common Analysis
+ Structure (CAS) used to support processing in the UIMA collection processing
+ engine.</para></listitem>
+
+ <listitem><para>The <emphasis>CAS Processors</emphasis>, responsible for
+ analyzing individual artifacts, analyzing across artifacts, and extracting
+ analysis results. CAS Processors include <emphasis>Analysis Engines</emphasis>
+ and <emphasis>CAS Consumers</emphasis>.</para></listitem>
+
+ <listitem><para>Operational parameters of the <emphasis>Collection Processing
+ Manager</emphasis> (CPM), such as checkpoint frequency and deployment
+ mode.</para></listitem>
+
+ <listitem><para>Resource Manager Configuration (optional). </para></listitem>
+ </orderedlist>
+
+ <para>The CPE Descriptor has the following high level skeleton:
+
+
+ <programlisting><![CDATA[<?xml version="1.0"?>
+<cpeDescription>
+ <collectionReader>
+...
+ </collectionReader>
+ <casProcessors>
+...
+ </casProcessors>
+ <cpeConfig>
+...
+ </cpeConfig>
+ <resourceManagerConfiguration>
+...
+ </resourceManagerConfiguration>
+</cpeDescription>]]></programlisting></para>
+
+ <para>Details of each of the four main elements are described in the sections that
+ follow.</para>
+ </section>
+ <section id="&tp;descriptor.collection_reader">
+ <title>Collection Reader</title>
+
+ <para>The <literal><collectionReader></literal> section identifies the
+ Collection Reader and optional CAS Initializer that are to be used in the CPE. The
+ Collection Reader is responsible for retrieval of artifacts from a collection
+ outside of the CPE, and the optional CAS Initializer (deprecated as of UIMA Version 2)
+ is responsible for initializing the CAS with the artifact.</para>
+
+ <para>A Collection Reader may initialize the CAS itself, in which case it does not
+ require a CAS Initializer. This should be clearly specified in the documentation for
+ the Collection Reader. Specifying a CAS Initializer for a Collection Reader that
+ does not make use of a CAS Initializer will not cause an error, but the specified CAS
+ Initializer will not be used.</para>
+
+ <para>The complete structure of the <literal><collectionReader></literal>
+ section is:
+
+
+ <programlisting><![CDATA[<collectionReader>
+ <collectionIterator>
+ <descriptor>
+ <import ...> | <include .../>
+ </descriptor>
+ <configurationParameterSettings>...</configurationParameterSettings>
+ <sofaNameMappings>...</sofaNameMappings>
+ </collectionIterator>
+ <casInitializer>
+ <descriptor>
+ <import ...> | <include .../>
+ </descriptor>
+ <configurationParameterSettings>...</configurationParameterSettings>
+ <sofaNameMappings>...</sofaNameMappings>
+ </casInitializer>
+</collectionReader>]]></programlisting></para>
+
+ <para>The <literal><collectionIterator></literal> identifies the
+ descriptor for the Collection Reader, and the <literal><casInitializer>
+ </literal>identifies the descriptor for the CAS Initializer. The format and
+ details of the Collection Reader and CAS Initializer descriptors are described in
+ <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.component_descriptor.collection_processing_parts.collection_reader"/>
+ . The <literal><configurationParameterSettings> </literal>and the
+ <literal><sofaNameMappings></literal> elements are described in the next
+ section.</para>
+
+ <section id="&tp;descriptor.collection_reader.error_handling">
+ <title>Error handling for Collection Readers</title>
+
+ <para>The CPM will abort if the Collection Reader throws a large number of
+ consecutive exceptions (default = 100). This default can by changed by using the
+ Java initialization parameter <literal>-DMaxCRErrorThreshold
+ xxx.</literal></para>
+ </section>
+ </section>
+
+ <section id="&tp;descriptor.cas_processors">
+ <title>CAS Processors</title>
+
[... 1053 lines stripped ...]