You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by sc...@apache.org on 2010/05/06 16:06:04 UTC
svn commit: r941744 [6/7] - in
/uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides: ./
src/ src/docbook/ src/docbook/images/
src/docbook/images/tutorials_and_users_guides/
src/docbook/images/tutorials_and_users_guides/tug.aae/ src/d...
Added: uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/tug.cpe.xml
URL: http://svn.apache.org/viewvc/uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/tug.cpe.xml?rev=941744&view=auto
==============================================================================
--- uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/tug.cpe.xml (added)
+++ uima/uimaj/branches/mavenAlign/uima-docbook-tutorials-and-users-guides/src/docbook/tug.cpe.xml Thu May 6 14:06:02 2010
@@ -0,0 +1,1333 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+<!ENTITY imgroot "images/tutorials_and_users_guides/tug.cpe/">
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent">
+%uimaents;
+]>
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<chapter id="ugr.tug.cpe">
+ <title>Collection Processing Engine Developer's Guide</title>
+ <titleabbrev>CPE Developer's Guide</titleabbrev>
+
+ <para>The UIMA Analysis Engine interface provides support for developing and integrating
+ algorithms that analyze unstructured data. Analysis Engines are designed to operate on a
+ per-document basis. Their interface handles one CAS at a time. UIMA provides additional
+ support for applying analysis engines to collections of unstructured data with its
+ <emphasis>Collection Processing Architecture</emphasis>. The Collection
+ Processing Architecture defines additional components for reading raw data formats
+ from data collections, preparing the data for processing by Analysis Engines, executing
+ the analysis, extracting analysis results, and deploying the overall flow in a variety of
+ local and distributed configurations.</para>
+
+ <para>The functionality defined in the Collection Processing Architecture is
+ implemented by a <emphasis>Collection Processing Engine</emphasis> (CPE). A CPE
+ includes an Analysis Engine and adds a <emphasis>Collection Reader</emphasis>, a
+ <emphasis>CAS Initializer</emphasis> (deprecated as of version 2), and <emphasis>CAS
+ Consumers</emphasis>. The part of the UIMA Framework that supports the execution of
+ CPEs is called the Collection Processing Manager, or CPM.</para>
+
+ <para>A Collection Reader provides the interface to the raw input data and knows how to
+ iterate over the data collection. Collection Readers are discussed in <xref
+ linkend="ugr.tug.cpe.collection_reader.developing"/>. The CAS Initializer
+ <footnote><para>CAS Initializers are deprecated in favor of a more general mechanism,
+ multiple subjects of analysis.</para></footnote> prepares an individual data item for
+ analysis and loads it into the CAS. CAS Initializers are discussed in <xref
+ linkend="ugr.tug.cpe.cas_initializer.developing"/> A CAS Consumer extracts
+ analysis results from the CAS and may also perform <emphasis>collection level
+ processing</emphasis>, or analysis over a collection of CASes. CAS Consumers are
+ discussed in <xref linkend="ugr.tug.cpe.cas_consumer.developing"/>.</para>
+
+ <para>Analysis Engines and CAS Consumers are both instances of <emphasis>CAS
+ Processors</emphasis>. A Collection Processing Engine (CPE) may contain multiple CAS
+ Processors. An Analysis Engine contained in a CPE may itself be a Primitive or an Aggregate
+ (composed of other Analysis Engines). Aggregates may contain Cas Consumers. While
+ Collection Readers and CAS Initializers always run in the same JVM as the CPM, a CAS
+ Processor may be deployed in a variety of local and distributed modes, providing a number
+ of options for scalability and robustness. The different deployment options are covered
+ in detail in <xref linkend="ugr.tug.cpe.deployment_alternatives"/>.</para>
+
+ <para>Each of the components in a CPE has an interface specified by the UIMA Collection
+ Processing Architecture and is described by a declarative XML descriptor file.
+ Similarly, the CPE itself has a well defined component interface and is described by a
+ declarative XML descriptor file.</para>
+
+ <para>A user creates a CPE by assembling the components mentioned above. The UIMA SDK
+ provides a graphical tool, called the CPE Configurator, for assisting in the assembly of
+ CPEs. Use of this tool is summarized in <xref
+ linkend="ugr.tug.cpe.cpe_configurator"/>, and more details can be found in <olink
+ targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>.
+ Alternatively, a CPE can be assembled by writing an XML CPE descriptor. Details on the CPE
+ descriptor, including its syntax and content, can be found in the <olink
+ targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. The individual
+ components have associated XML descriptors, each of which can be created and / or edited
+ using the <olink targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cde">
+ Component Description Editor</olink>.</para>
+
+ <para>A CPE is executed by a UIMA infrastructure component called the
+ <emphasis>Collection Processing Manager</emphasis> (CPM). The CPM provides a number
+ of services and deployment options that cover instantiation and execution of CPEs, error
+ recovery, and local and distributed deployment of the CPE components.</para>
+
+ <section id="ugr.tug.cpe.concepts">
+ <title>CPE Concepts</title>
+
+ <para> <xref linkend="ugr.tug.cpe.fig.cpe_components"/> illustrates the data flow
+ that occurs between the different types of components that make up a CPE.</para>
+
+ <figure id="ugr.tug.cpe.fig.cpe_components">
+ <title>CPE Components</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="PNG"
+ fileref="&imgroot;image002.png"/>
+ </imageobject>
+ <textobject><phrase>CPE Components and flow between them</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+
+ <para>The components of a CPE are:</para>
+
+ <itemizedlist><listitem><para><emphasis>Collection Reader –</emphasis>
+ interfaces to a collection of data items (e.g., documents) to be analyzed. Collection
+ Readers return CASes that contain the documents to analyze, possibly along with
+ additional metadata.</para></listitem>
+
+ <listitem><para><emphasis>Analysis Engine –</emphasis> takes a CAS,
+ analyzes its contents, and produces an enriched CAS. Analysis Engines can be
+ recursively composed of other Analysis Engines (called an
+ <emphasis>Aggregate</emphasis> Analysis Engine). Aggregates may also contain
+ CAS Consumers.</para></listitem>
+
+ <listitem><para><emphasis>CAS Consumer –</emphasis> consume the enriched
+ CAS that was produced by the sequence of Analysis Engines before it, and produce an
+ application-specific data structure, such as a search engine index or database.
+ </para></listitem></itemizedlist>
+
+ <para>A fourth type of component, the <emphasis>CAS Initializer,</emphasis> may be
+ used by a Collection Reader to populate a CAS from a document. However, as of UIMA
+ version 2 CAS Initializers are now deprecated in favor of a more general mechsanism,
+ multiple Subjects of Analysis.</para>
+
+ <para>The Collection Processing Manager orchestrates the data flow
+ within a CPE, monitors status, optionally manages the life-cycle of internal
+ components and collects statistics.</para>
+
+ <para>CASes are not saved in a persistent way by the framework. If you want to save CASes,
+ then you have to save each CAS as it comes through (for example) using a CAS Consumer you
+ write to do this, in whatever format you like. The UIMA SDK supplies an example CAS
+ Consumer to save CASes to XML files, either in the standard XMI format or in an older
+ format called XCAS. It also supplies an example CAS Consumer to extract information from CASes and
+ store the results into a relational Database, using Java's JDBC APIs.</para>
+
+ </section>
+
+ <section id="ugr.tug.cpe.configurator_and_viewer">
+ <title>CPE Configurator and CAS viewer</title>
+
+ <section id="ugr.tug.cpe.cpe_configurator">
+ <title>Using the CPE Configurator</title>
+
+ <para>A CPE can be assembled by writing an XML CPE descriptor. Details on the CPE
+ descriptor, including its syntax and content, can be found in <olink
+ targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>. Rather than
+ edit raw XML, you may develop a CPE Descriptor using the CPE Configurator tool. The CPE
+ Configurator tool is described briefly in this section, and in more detail in <olink
+ targetdoc="&uima_docs_tools;" targetptr="ugr.tools.cpe"/>.</para>
+
+ <para>The CPE Configurator tool can be run from Eclipse (see <xref
+ linkend="ugr.tug.cpe.running_cpe_configurator_from_eclipse"/>, or using
+ the <literal>cpeGui</literal> shell script (<literal>cpeGui.bat</literal> on
+ Windows, <literal>cpeGui.sh</literal> on Unix), which is located in the
+ <literal>bin</literal> directory of the UIMA SDK installation. Executing this
+ batch file will display the window shown here:
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image004.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of CPE GUI</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+
+ <para>The window is divided into three sections, one each for the Collection Reader,
+ Analysis Engines, and CAS Consumers.<footnote><para>There is also a fourth pane,
+ for the CAS Initializer, but it is hidden by default. To enable it click the
+ <literal>View → CAS Initializer Panel</literal> menu item.</para></footnote>
+ In each section, you select the component(s) you want to include in the CPE by
+ browsing to their XML descriptors. The configuration parameters present in the XML
+ descriptors will then be displayed in the GUI; these can be modified to override
+ the values present in the descriptor. For example, the screen shot below shows the
+ CPE Configurator after the following components have been chosen:
+
+
+ <programlisting>Collection Reader:
+ %UIMA_HOME%/examples/descriptors/collection_reader/
+ FileSystemCollectionReader.xml
+
+Analysis Engine:
+ %UIMA_HOME%/examples/descriptors/analysis_engine/
+ NamesAndPersonTitles_TAE.xml
+
+CAS Consumer:
+ %UIMA_HOME%/examples/descriptors/cas_consumer/
+ XmiWriterCasConsumer.xml</programlisting></para>
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image006.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of CPE GUI after fields filled in</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>For the File System Collection Reader, ensure that the Input Directory is set to
+ <literal>%UIMA_HOME%\examples\data</literal><footnote><para>Replace
+ <literal>%UIMA_HOME%</literal> with the path to where you installed UIMA.</para>
+ </footnote>. The other parameters may be left blank. For the External CAS Writer CAS
+ Consumer, ensure that the Output Directory is set to
+ <literal>%UIMA_HOME%\examples\data\processed</literal>.</para>
+
+ <para>After selecting each of the components and providing configuration settings,
+ click the play (forward arrow) button at the bottom of the screen to begin processing.
+ A progress bar should be displayed in the lower left corner. (Note that the progress
+ bar will not begin to move until all components have completed their initialization,
+ which may take several seconds.) Once processing has begun, the pause and stop
+ buttons become enabled.</para>
+
+ <para>If an error occurs, you will be informed by an error dialog. If processing
+ completes successfully, you will be presented with a performance report.</para>
+
+ <para>Using the File menu, you can select <literal>Save CPE Descriptor </literal>to
+ create an .xml descriptor file that defines the CPE you have constructed. Later, you
+ can use <literal>Open CPE Descriptor</literal> to restore the CPE Configurator to
+ the saved state. Also, CPE descriptors can be used to run a CPE from a Java program
+ – see section <xref
+ linkend="ugr.tug.cpe.running_cpe_from_application"/>. CPE Descriptors
+ allow specifying operational parameters, such as error handling options, that are
+ not currently available for configuration through the CPE Configurator. For more
+ information on manually creating a CPE Descriptor, see the <olink
+ targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>.</para>
+
+ <para>The CPE configured above runs a simple name and title annotator on the sample data
+ provided with the UIMA SDK and stores the results using the XMI Writer CAS Consumer. To
+ view the results, start the External CAS Annotation Viewer by running the
+ <literal>annotationViewer</literal> batch file
+ (<literal>annotationViewer.bat</literal> on Windows,
+ <literal>annotationViewer.sh</literal> on Unix), which is located in the
+ <literal>bin</literal> directory of the UIMA SDK installation. Executing this
+ batch file will display the window shown here:
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.5in" format="JPG" fileref="&imgroot;image008.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of Annotation Viewer results</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+
+ <para>Ensure that the Input Directory is the same as the Output Directory specified for
+ the XMI Writer CAS Consumer in the CPE configured above (e.g.,
+ <literal>%UIMA_HOME%\examples\data\processed</literal>) and that the TAE
+ Descriptor File is set to the Analysis Engine used in the CPE configured above (e.g.,
+ <literal>examples\descriptors\analysis_engine\NamesAndPersonTitles_TAE.xml</literal>
+ ).</para>
+
+ <para>Click the View button to display the Analyzed Documents window:
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="3.5in" format="JPG" fileref="&imgroot;image010.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of CPE Configurator Analyzed Documents</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+
+ <para>Double click on any document in the list to view the analyzed document. Double
+ clicking the first document, IBM_LifeSciences.txt, will bring up the following
+ window:
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="JPG" fileref="&imgroot;image012.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot of Document and Annotation Viewer</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+
+ <para>This window shows the analysis results for the document. Clicking on any
+ highlighted annotation causes the details for that annotation to be displayed in the
+ right-hand pane. Here the annotation spanning <quote>John M. Thompson</quote> has
+ been clicked.</para>
+
+ <para>Congratulations! You have successfully configured a CPE, saved its
+ descriptor, run the CPE, and viewed the analysis results.</para>
+ </section>
+
+ <section id="ugr.tug.cpe.running_cpe_configurator_from_eclipse">
+ <title>Running the CPE Configurator from Eclipse</title>
+
+ <para>If you have followed the instructions in <olink
+ targetdoc="&uima_docs_overview;"
+ targetptr="ugr.ovv.eclipse_setup"/> and imported the example Eclipse
+ project, then you should already have a Run configuration for the CPE Configurator
+ tool (called <literal>UIMA CPE GUI</literal>) configured to run in the example
+ project. Simply run that configuration to start the CPE Configurator.</para>
+
+ <para>If you haven't followed the Eclipse setup instructions and wish to run the
+ CPE Configurator tool from Eclipse, you will need to do the following. As installed,
+ this Eclipse launch configuration is associated with the
+ <quote>uimaj-examples</quote> project. If you've not already done so, you
+ may wish to import that project into your Eclipse workspace. It's located in
+ %UIMA_HOME%/docs/examples. Doing this will supply the Eclipse launcher with all
+ the class files it needs to run the CPE configurator. If you don't do this, please
+ manually add the JAR files for UIMA to the launch configuration.</para>
+ <para>Also, you need to add any projects or JAR files for any UIMA components you will be
+ running to the launch class path.</para> <note><para>A simpler alternative may be
+ to change the CPE launch configuration to be based on your project. If you do that, it will
+ pick up all the files in your project's class path, which you should set up to
+ include all the UIMA framework files. An easy way to do this is to specify in your
+ project's properties' build-path that the uimaj-examples project is on
+ the build path, because the uimaj-examples project is set up to include all the UIMA
+ framework classes in its classpath already. </para></note>
+
+ <para>Next, in the Eclipse menu select <literal>Run →
+ Run</literal>..., which brings up the Run configuration screen.</para>
+
+ <para>In the Main tab, set the main class to
+ <literal>org.apache.uima.tools.cpm.CpmFrame</literal></para>
+
+ <para>In the arguments tab, add the following to the VM arguments:
+
+
+ <programlisting>-Xms128M -Xmx256M
+-Duima.home="C:\Program Files\Apache\uima"</programlisting>
+ (or wherever you installed the UIMA SDK)</para>
+
+ <para>Click the Run button to launch the CPE Configurator, and use it as previously
+ described in this section.</para>
+
+ </section>
+ </section>
+
+ <section id="ugr.tug.cpe.running_cpe_from_application">
+ <title>Running a CPE from Your Own Java Application</title>
+
+ <para>The simplest way to run a CPE from a Java application is to first create a CPE
+ descriptor as described in the previous section. Then the CPE can be instantiated and
+ run using the following code:
+
+
+ <programlisting> //parse CPE descriptor in file specified on command line
+CpeDescription cpeDesc = UIMAFramework.getXMLParser().
+ parseCpeDescription(new XMLInputSource(args[0]));
+
+ //instantiate CPE
+mCPE = UIMAFramework.produceCollectionProcessingEngine(cpeDesc);
+
+ //Create and register a Status Callback Listener
+mCPE.addStatusCallbackListener(new StatusCallbackListenerImpl());
+
+ //Start Processing
+mCPE.process();</programlisting></para>
+
+ <para>This will start the CPE running in a separate thread.</para>
+
+ <note><para>The <literal>process()</literal> method for a CPE can only be called once. If you
+ need to call it again, you have to instantiate a new CPE, and call that new CPE's process
+ method.</para></note>
+
+ <section id="ugr.tug.cpe.using_listeners">
+ <title>Using Listeners</title>
+
+ <para>Updates of the CPM's progress, including any errors that occur, are sent to
+ the callback handler that is registered by the call to
+ <literal>addStatusCallbackListener</literal>, above. The callback handler is a
+ class that implements the CPM's
+ <literal>StatusCallbackListener</literal> interface. It responds to events by
+ printing messages to the console. The source code is fairly straightforward and is
+ not included in this chapter – see the
+ <literal>org.apache.uima.examples.cpe.SimpleRunCPE.java</literal> in the
+ <literal>%UIMA_HOME%\examples\src</literal> directory for the complete
+ code.</para>
+
+ <para>If you need more control over the information in the CPE descriptor, you can
+ manually configure it via its API. See the Javadocs for package
+ <literal>org.apache.uima.collection</literal> for more details.</para>
+
+ </section>
+ </section>
+
+ <section id="ugr.tug.cpe.developing_collection_processing_components">
+ <title>Developing Collection Processing Components</title>
+
+ <para>This section is an introduction to the process of developing Collection Readers,
+ CAS Initializers, and CAS Consumers. The code snippets refer to the classes that can be
+ found in <literal>%UIMA_HOME%\examples\src </literal>example project.</para>
+
+ <para>In the following sections, classes you write to represent components need to be
+ public and have public, 0-argument constructors, so that they can be instantiated by
+ the framework. (Although Java classes in which you do not define any constructor will,
+ by default, have a 0-argument constructor that doesn't do anything, a class in
+ which you have defined at least one constructor does not get a default 0-argument
+ constructor.)</para>
+
+ <section id="ugr.tug.cpe.collection_reader.developing">
+ <title>Developing Collection Readers</title>
+
+ <para>A Collection Reader is responsible for obtaining documents from the collection
+ and returning each document as a CAS. Like all UIMA components, a Collection Reader
+ consists of two parts — the code and an XML descriptor.</para>
+
+ <para>A simple example of a Collection Reader is the <quote>File System Collection
+ Reader,</quote> which simply reads documents from files in a specified directory.
+ The Java code is in the class
+ <literal>org.apache.uima.examples.cpe.FileSystemCollectionReader</literal>
+ and the XML descriptor is
+ <literal>%UIMA_HOME%/examples/src/main/descriptors/collection_reader/
+ FileSystemCollectionReader.xml</literal>.</para>
+
+ <section id="ugr.tug.cpe.collection_reader.java_class">
+ <title>Java Class for the Collection Reader</title>
+
+ <para>The Java class for a Collection Reader must implement the
+ <literal>org.apache.uima.collection.CollectionReader</literal>
+ interface. You may build your Collection Reader from scratch and implement this
+ interface, or you may extend the convenience base class
+ <literal>org.apache.uima.collection.CollectionReader_ImplBase</literal>
+ .</para>
+
+ <para>The convenience base class provides default implementations for many of the
+ methods defined in the <literal>CollectionReader</literal> interface, and
+ provides abstract definitions for those methods that you are required to
+ implement in your new Collection Reader. Note that if you extend this base class,
+ you do not need to declare that your new Collection Reader implements the
+ <literal>CollectionReader</literal> interface.</para> <tip><para>Eclipse
+ tip – if you are using Eclipse, you can quickly create the boiler plate code and
+ stubs for all of the required methods by clicking <literal>File</literal>
+ → <literal>New</literal> → <literal>Class</literal> to bring up the <quote>New Java Class</quote>
+ dialogue, specifying
+ <literal>org.apache.uima.collection.CollectionReader_ImplBase</literal>
+ as the Superclass, and checking <quote>Inherited abstract methods</quote> in the
+ section <quote>Which method stubs would you like to create?</quote>, as in the
+ screenshot below:</para></tip>
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="4.4in" format="JPG" fileref="&imgroot;image014.jpg"/>
+ </imageobject>
+ <textobject><phrase>Screenshot showing Eclipse new class wizard</phrase></textobject>
+ </mediaobject>
+ </screenshot>
+
+ <para>For the rest of this section we will assume that your new Collection Reader
+ extends the <literal>CollectionReader_ImplBase</literal> class, and we will
+ show examples from the
+ <literal>org.apache.uima.examples.cpe.FileSystemCollectionReader</literal>
+ . If you must inherit from a different superclass, you must ensure that your
+ Collection Reader implements the <literal>CollectionReader</literal>
+ interface – see the Javadocs for <literal>CollectionReader</literal>
+ for more details.</para>
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.required_methods">
+ <title>Required Methods in the Collection Reader class</title>
+
+
+ <para>The following abstract methods must be implemented:</para>
+
+ <section id="ugr.tug.cpe.collection_reader.required_methods.initialize">
+ <title>initialize()</title>
+
+ <para>The <literal>initialize()</literal> method is called by the framework
+ when the Collection Reader is first created.
+ <literal>CollectionReader_ImplBase</literal> actually provides a default
+ implementation of this method (i.e., it is not abstract), so you are not strictly
+ required to implement this method. However, a typical Collection Reader will
+ implement this method to obtain parameter values and perform various
+ initialization steps.</para>
+
+ <para>In this method, the Collection Reader class can access the values of its
+ configuration parameters and perform other initialization logic. The example
+ File System Collection Reader reads its configuration parameters and then
+ builds a list of files in the specified input directory, as follows:</para>
+
+
+ <programlisting>public void initialize() throws ResourceInitializationException {
+ File directory = new File(
+ (String)getConfigParameterValue(PARAM_INPUTDIR));
+ mEncoding = (String)getConfigParameterValue(PARAM_ENCODING);
+ mDocumentTextXmlTagName = (String)getConfigParameterValue(PARAM_XMLTAG);
+ mLanguage = (String)getConfigParameterValue(PARAM_LANGUAGE);
+ mCurrentIndex = 0;
+
+ //get list of files (not subdirectories) in the specified directory
+ mFiles = new ArrayList();
+ File[] files = directory.listFiles();
+ for (int i = 0; i < files.length; i++) {
+ if (!files[i].isDirectory()) {
+ mFiles.add(files[i]);
+ }
+ }
+}</programlisting>
+ <note><para>This is the zero-argument version of the initialize method. There is
+ also a method on the Collection Reader interface called
+ <literal>initialize(ResourceSpecifier, Map)</literal> but it is not
+ recommended that you override this method in your code. That method performs
+ internal initialization steps and then calls the zero-argument
+ <literal>initialize()</literal>. </para></note>
+
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.hasnext">
+ <title>hasNext()</title>
+
+ <para>The <literal>hasNext()</literal> method returns whether or not there are
+ any documents remaining to be read from the collection. The File System
+ Collection Reader's <literal>hasNext()</literal> method is very
+ simple. It just checks if there are any more files left to be read:
+
+
+ <programlisting>public boolean hasNext() {
+ return mCurrentIndex < mFiles.size();
+}</programlisting>
+ </para>
+
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.required_methods.getnext">
+ <title>getNext(CAS)</title>
+
+ <para>The <literal>getNext()</literal> method reads the next document from the
+ collection and populates a CAS. In the simple case, this amounts to reading the
+ file and calling the CAS's <literal>setDocumentText</literal> method.
+ The example File System Collection Reader is slightly more complex. It first
+ checks for a CAS Initializer. If the CPE includes a CAS Initializer, the CAS
+ Initializer is used to read the document, and
+ <literal>initialize()</literal> the CAS. If the CPE does not include a CAS
+ Initializer, the File System Collection Reader reads the document and sets the
+ document text in the CAS.</para>
+
+ <para>The File System Collection Reader also stores additional metadata about
+ the document in the CAS. In particular, it sets the document's language in
+ the special built-in feature structure
+ <literal>uima.tcas.DocumentAnnotation </literal>(see <olink
+ targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.cas.document_annotation"/> for details about this
+ built-in type) and creates an instance of
+ <literal>org.apache.uima.examples.SourceDocumentInformation</literal>
+ , which stores information about the document's source location. This
+ information may be useful to downstream components such as CAS Consumers. Note
+ that the type system descriptor for this type can be found in
+ <literal>org.apache.uima.examples.SourceDocumentInformation.xml</literal>
+ , which is located in the <literal>examples/src</literal> directory.</para>
+
+ <para>The getNext() method for the File System Collection Reader looks like
+ this:</para>
+
+
+ <programlisting> public void getNext(CAS aCAS) throws IOException, CollectionException {
+ JCas jcas;
+ try {
+ jcas = aCAS.getJCas();
+ } catch (CASException e) {
+ throw new CollectionException(e);
+ }
+
+ // open input stream to file
+ File file = (File) mFiles.get(mCurrentIndex++);
+ BufferedInputStream fis =
+ new BufferedInputStream(new FileInputStream(file));
+ try {
+ byte[] contents = new byte[(int) file.length()];
+ fis.read(contents);
+ String text;
+ if (mEncoding != null) {
+ text = new String(contents, mEncoding);
+ } else {
+ text = new String(contents);
+ }
+ // put document in CAS
+ jcas.setDocumentText(text);
+ } finally {
+ if (fis != null)
+ fis.close();
+ }
+
+ // set language if it was explicitly specified
+ //as a configuration parameter
+ if (mLanguage != null) {
+ ((DocumentAnnotation) jcas.getDocumentAnnotationFs()).
+ setLanguage(mLanguage);
+ }
+
+ // Also store location of source document in CAS.
+ // This information is critical if CAS Consumers will
+ // need to know where the original document contents
+ // are located.
+ // For example, the Semantic Search CAS Indexer
+ // writes this information into the search index that
+ // it creates, which allows applications that use the
+ // search index to locate the documents that satisfy
+ //their semantic queries.
+ SourceDocumentInformation srcDocInfo =
+ new SourceDocumentInformation(jcas);
+ srcDocInfo.setUri(
+ file.getAbsoluteFile().toURL().toString());
+ srcDocInfo.setOffsetInSource(0);
+ srcDocInfo.setDocumentSize((int) file.length());
+ srcDocInfo.setLastSegment(
+ mCurrentIndex == mFiles.size());
+ srcDocInfo.addToIndexes();
+ }</programlisting>
+
+ <para>The Collection Reader can create additional annotations in the CAS at this
+ point, in the same way that annotators create annotations.</para>
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.required_methods.getprogress">
+ <title>getProgress()</title>
+ <para>The Collection Reader is responsible for returning progress information;
+ that is, how much of the collection has been read thus far and how much remains to be
+ read. The framework defines progress very generally; the Collection Reader
+ simply returns an array of <literal>Progress</literal> objects, where each
+ object contains three fields — the amount already completed, the total
+ amount (if known), and a unit (e.g. entities (documents), bytes, or files). The
+ method returns an array so that the Collection Reader can report progress in
+ multiple different units, if that information is available. The File System
+ Collection Reader's <literal>getProgress()</literal> method looks
+ like this:
+
+
+ <programlisting>public Progress[] getProgress() {
+ return new Progress[]{
+ new ProgressImpl(mCurrentIndex,mFiles.size(),Progress.ENTITIES)};
+}</programlisting></para>
+
+ <para>In this particular example, the total number of files in the collection is
+ known, but the total size of the collection is not known. As such, a
+ <literal>ProgressImpl</literal> object for
+ <literal>Progress.ENTITIES</literal> is returned, but a
+ <literal>ProgressImpl</literal> object for
+ <literal>Progress.BYTES</literal> is not.</para>
+
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.required_methods.close">
+ <title>close()</title>
+
+ <para>The close method is called when the Collection Reader is no longer needed.
+ The Collection Reader should then release any resources it may be holding. The
+ FileSystemCollectionReader does not hold resources and so has an empty
+ implementation of this method:</para>
+
+
+ <programlisting>public void close() throws IOException { }</programlisting>
+
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.optional_methods">
+ <title>Optional Methods</title>
+
+ <para>The following methods may be implemented:</para>
+
+ <section id="ugr.tug.cpe.collection_reader.optional_methods.reconfigure">
+ <title>reconfigure()</title>
+ <para>This method is called if the Collection Reader's configuration
+ parameters change.</para>
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.optional_methods.typesysteminit">
+ <title>typeSystemInit()</title>
+
+ <para>If you are only setting the document text in the CAS, or if you are using the
+ JCas (recommended, as in the current example, you do not have to implement this
+ method. If you are directly using the CAS API, this method is used in the same way
+ as it is used for an annotator – see <olink
+ targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.aae.contract_for_annotator_methods"/>
+ for more information.</para>
+ </section>
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.threading">
+ <title>Threading considerations</title>
+
+ <para>Collection readers do not have to be thread safe; they are run with a single
+ thread per instance, and only one instance per instance of the Collection
+ Processing Manager (CPM) is made.</para>
+
+ </section>
+
+ <section id="ugr.tug.cpe.collection_reader.descriptor">
+ <title>XML Descriptor for a Collection Reader</title>
+
+ <para>You can use the Component Description Editor to create and / or edit the File
+ System Collection Reader's descriptor. Here is its descriptor
+ (abbreviated somewhat), which is very similar to an Analysis
+ Engine descriptor:</para>
+
+
+ <programlisting><?db-font-size 80% ?><![CDATA[<collectionReaderDescription
+ xmlns="http://uima.apache.org/resourceSpecifier">
+ <frameworkImplementation>org.apache.uima.java</frameworkImplementation>
+ <implementationName>
+ org.apache.uima.examples.cpe.FileSystemCollectionReader
+ </implementationName>
+ <processingResourceMetaData>
+ <name>File System Collection Reader</name>
+ <description>Reads files from the filesystem.</description>
+ <version>1.0</version>
+ <vendor>The Apache Software Foundation</vendor>
+ <configurationParameters>
+ <configurationParameter>
+ <name>InputDirectory</name>
+ <description>Directory containing input files</description>
+ <type>String</type>
+ <multiValued>false</multiValued>
+ <mandatory>true</mandatory>
+ </configurationParameter>
+ <configurationParameter>
+ <name>Encoding</name>
+ <description>Character encoding for the documents.</description>
+ <type>String</type>
+ <multiValued>false</multiValued>
+ <mandatory>false</mandatory>
+ </configurationParameter>
+ <configurationParameter>
+ <name>Language</name>
+ <description>ISO language code for the documents</description>
+ <type>String</type>
+ <multiValued>false</multiValued>
+ <mandatory>false</mandatory>
+ </configurationParameter>
+ </configurationParameters>
+ <configurationParameterSettings>
+ <nameValuePair>
+ <name>InputDirectory</name>
+ <value>
+ <string>C:/Program Files/apache/uima/examples/data</string>
+ </value>
+ </nameValuePair>
+ </configurationParameterSettings>
+
+ <!-- Type System of CASes returned by this Collection Reader -->
+
+ <typeSystemDescription>
+ <imports>
+ <import name="org.apache.uima.examples.SourceDocumentInformation"/>
+ </imports>
+ </typeSystemDescription>
+
+ <capabilities>
+ <capability>
+ <inputs/>
+ <outputs>
+ <type allAnnotatorFeatures="true">
+ org.apache.uima.examples.SourceDocumentInformation
+ </type>
+ </outputs>
+ </capability>
+ </capabilities>
+ <operationalProperties>
+ <modifiesCas>true</modifiesCas>
+ <multipleDeploymentAllowed>false</multipleDeploymentAllowed>
+ <outputsNewCASes>true</outputsNewCASes>
+ </operationalProperties>
+ </processingResourceMetaData>
+</collectionReaderDescription>]]></programlisting>
+
+ </section>
+ </section>
+ </section>
+
+ <section id="ugr.tug.cpe.cas_initializer.developing"><title>Developing CAS
+ Initializers</title> <note><para>CAS Initializers are now deprecated (as of
+ version 2.1). For complex initialization, please use instead the capabilities of
+ creating additional Subjects of Analysis (see <olink
+ targetdoc="&uima_docs_tutorial_guides;" targetptr="ugr.tug.mvs"/>
+ ). </para></note>
+
+ <para>In UIMA 1.x, the CAS Initializer component was intended to be used as a plug-in
+ to the Collection Reader for when the task of populating the CAS from a raw document is
+ complex and might be reusable with other data collections.</para>
+
+ <para>A CAS Initializer Java class must implement the interface
+ <literal>org.apache.uima.collection.CasInitializer</literal>, and will also
+ generally extend from the convenience base class
+ <literal>org.apache.uima.collection.CasInitializer_ImplBase</literal>. A
+ CAS Initializer also must have an XML descriptor, which has the exact same form as a
+ Collection Reader Descriptor except that the outer tag is
+ <literal><casInitializerDescription></literal>.</para>
+
+ <para>CAS Initializers have optional <literal>initialize()</literal>,
+ <literal>reconfigure()</literal>, and <literal>typeSystemInit()</literal>
+ methods, which perform the same functions as they do for Collection Readers. The only
+ required method for a CAS Initializer is <literal>initializeCas(Object,
+ CAS)</literal>. This method takes the raw document (for example, an
+ <literal>InputStream</literal> object from which the document can be read) and a
+ CAS, and populates the CAS from the document.</para>
+ </section>
+
+ <section id="ugr.tug.cpe.cas_consumer.developing"><title>Developing CAS
+ Consumers</title>
+
+ <note><para>In version 2, there is no difference in capability
+ between CAS Consumers and ordinary Analysis Engines, except for the default setting of
+ the XML parameters for <literal>multipleDeploymentAllowed</literal> and
+ <literal>modifiesCas</literal>. We recommend for future work that users implement
+ and use Analysis Engine components instead of CAS Consumers.</para>
+ <para>The rest of this section is written using the version 1 style of CAS Consumer;
+ the methods described are also available for Analysis Engines. Note that the
+ CAS Consumer <literal>processCAS</literal> method is equivalent to the Analysis Engine
+ <literal>process</literal> method.</para></note>
+
+ <para>A CAS Consumer receives each CAS after it has been analyzed by the Analysis
+ Engine. CAS Consumers typically do not update the CAS; they typically extract data
+ from the CAS and persist selected information to aggregate data structures such as
+ search engine indexes or databases.</para>
+
+ <para>A CAS Consumer Java class must implement the interface
+ <literal>org.apache.uima.collection.CasConsumer</literal>, and will also
+ generally extend from the convenience base class
+ <literal>org.apache.uima.collection.CasConsumer_ImplBase</literal>. A CAS
+ Consumer also must have an XML descriptor, which has the exact same form as a
+ Collection Reader Descriptor except that the outer tag is
+ <literal><casConsumerDescription></literal>.</para>
+
+ <para>CAS Consumers have optional <literal>initialize()</literal>,
+ <literal>reconfigure()</literal>, and <literal>typeSystemInit()</literal>
+ methods, which perform the same functions as they do for Collection Readers and CAS
+ Initializers. The only required method for a CAS Consumer is
+ <literal>processCas(CAS)</literal>, which is where the CAS Consumer does the bulk
+ of its work (i.e., consume the CAS).</para>
+
+ <para>The <literal>CasConsumer</literal> interface (as well as the version 2
+ Analysis Engine interfac) additionally defines batch
+ and collection level processing methods. The CAS Consumer or Analysis Engine
+ can implement the
+ <literal>batchProcessComplete()</literal> method to perform processing that
+ should occur at the end of each batch of CASes. Similarly, the CAS Consumer
+ or Analysis Engine can
+ implement the <literal>collectionProcessComplete()</literal> method to
+ perform any collection level processing at the end of the collection.</para>
+
+ <para>A very simple example of a CAS Consumer, which writes an XML representation of the
+ CAS to a file, is the XMI Writer CAS Consumer. The Java code is in the class
+ <literal>org.apache.uima.examples.cpe.XmiWriterCasConsumer</literal> and
+ the descriptor is in
+ <literal>%UIMA_HOME%/examples/descriptors/cas_consumer/XmiWriterCasConsumer.xml</literal>
+ .</para>
+
+ <section id="ugr.tug.cpe.cas_consumer.required_methods">
+ <title>Required Methods for a CAS Consumer</title>
+
+ <para>When extending the convenience class
+ <literal>org.apache.uima.collection.CasConsumer_ImplBase</literal>, the
+ following abstract methods must be implemented:</para>
+
+ <section id="ugr.tug.cpe.cas_consumer.required_methods.initialize">
+ <title>initialize()</title>
+ <para>The <literal>initialize()</literal> method is called by the framework
+ when the CAS Consumer is first created.
+ <literal>CasConsumer_ImplBase</literal> actually provides a default
+ implementation of this method (i.e., it is not abstract), so you are not strictly
+ required to implement this method. However, a typical CAS Consumer will
+ implement this method to obtain parameter values and perform various
+ initialization steps.</para>
+
+ <para>In this method, the CAS Consumer can access the values of its configuration
+ parameters and perform other initialization logic. The example XMI Writer CAS
+ Consumer reads its configuration parameters and sets up the output directory:
+
+
+ <programlisting><?db-font-size 80% ?>public void initialize() throws ResourceInitializationException {
+ mDocNum = 0;
+ mOutputDir = new File((String) getConfigParameterValue(PARAM_OUTPUTDIR));
+ if (!mOutputDir.exists()) {
+ mOutputDir.mkdirs();
+ }
+}</programlisting></para>
+ </section>
+
+ <section id="ugr.tug.cpe.cas_consumer.required_methods.processcas">
+ <title>processCas()</title>
+
+ <para>The <literal>processCas()</literal> method is where the CAS Consumer
+ does most of its work. In our example, the XMI Writer CAS Consumer obtains an
+ iterator over the document metadata in the CAS (in the
+ SourceDocumentInformation feature structure, which is created by the File
+ System Collection Reader) and extracts the URI for the current document. From
+ this the output filename is constructed in the output directory and a subroutine
+ (<literal>writeXmi</literal>) is called to generate the output file. The
+ <literal>writeXmi</literal> subroutine uses the
+ <literal>XmiCasSerializer</literal> class provided with the UIMA SDK to
+ serialize the CAS to the output file (see the example source code for
+ details).</para>
+
+
+ <programlisting>public void processCas(CAS aCAS) throws ResourceProcessException {
+ String modelFileName = null;
+
+ JCas jcas;
+ try {
+ jcas = aCAS.getJCas();
+ } catch (CASException e) {
+ throw new ResourceProcessException(e);
+ }
+
+ // retreive the filename of the input file from the CAS
+ FSIterator it = jcas
+ .getAnnotationIndex(SourceDocumentInformation.type)
+ .iterator();
+ File outFile = null;
+ if (it.hasNext()) {
+ SourceDocumentInformation fileLoc =
+ (SourceDocumentInformation) it.next();
+ File inFile;
+ try {
+ inFile = new File(new URL(fileLoc.getUri()).getPath());
+ String outFileName = inFile.getName();
+ if (fileLoc.getOffsetInSource() > 0) {
+ outFileName += ("_" + fileLoc.getOffsetInSource());
+ }
+ outFileName += ".xmi";
+ outFile = new File(mOutputDir, outFileName);
+ modelFileName = mOutputDir.getAbsolutePath() +
+ "/" + inFile.getName() + ".ecore";
+ } catch (MalformedURLException e1) {
+ // invalid URL, use default processing below
+ }
+ }
+ if (outFile == null) {
+ outFile = new File(mOutputDir, "doc" + mDocNum++);
+ }
+ // serialize XCAS and write to output file
+ try {
+ writeXmi(jcas.getCas(), outFile, modelFileName);
+ } catch (IOException e) {
+ throw new ResourceProcessException(e);
+ } catch (SAXException e) {
+ throw new ResourceProcessException(e);
+ }
+}</programlisting>
+
+ </section>
+
+ <section id="ugr.tug.cpe.cas_consumer.optional_methods">
+ <title>Optional Methods</title>
+ <para>The following methods are optional in a CAS Consumer, though they are often
+ used.</para>
+ <section id="ugr.tug.cpe.cas_consumer.optional_methods.batchprocesscomplete">
+ <title>batchProcessComplete()</title>
+
+ <para>The framework calls the batchProcessComplete() method at the end of each
+ batch of CASes. This gives the CAS Consumer or Analysis Engine
+ an opportunity to perform any batch
+ level processing. Our simple XMI Writer CAS Consumer does not perform any
+ batch level processing, so this method is empty. Batch size is set in the
+ Collection Processing Engine descriptor.</para>
+ </section>
+
+ <section id="ugr.tug.cpe.cas_consumer.optional_methods.collectionprocesscomplete">
+ <title>collectionProcessComplete()</title>
+
+ <para>The framework calls the collectionProcessComplete() method at the end
+ of the collection (i.e., when all objects in the collection have been
+ processed). At this point in time, no CAS is passed in as a parameter. This gives
+ the CAS Consumer or Analysis Engine an opportunity to perform collection processing over the
+ entire set of objects in the collection. Our simple XMI Writer CAS Consumer
+ does not perform any collection level processing, so this method is
+ empty.</para>
+ </section>
+
+ </section>
+
+ </section>
+ </section>
+ </section>
+
+ <section id="ugr.tug.cpe.deploying_a_cpe">
+ <title>Deploying a CPE</title>
+
+ <para>The CPM provides a number of service and deployment options that cover
+ instantiation and execution of CPEs, error recovery, and local and distributed
+ deployment of the CPE components. The behavior of the CPM (and correspondingly, the
+ CPE) is controlled by various options and parameters set in the CPE descriptor. The
+ current version of the CPE Configurator tool, however, supports only default error
+ handling and deployment options. To change these options, you must manually edit the
+ CPE descriptor.</para>
+
+ <para>Eventually the CPE Configurator tool will support configuring these options and a
+ detailed tutorial for these settings will be provided. In the meantime, we provide only
+ a high-level, conceptual overview of these advanced features in the rest of this
+ chapter, and refer the advanced user to <olink targetdoc="&uima_docs_ref;"
+ targetptr="ugr.ref.xml.cpe_descriptor"/> for details on setting these options in the CPE
+ Descriptor.</para>
+
+ <para> <xref linkend="ugr.tug.cpe.fig.cpe_instantiation"/> shows a logical view of
+ how an application uses the UIMA framework to instantiate a CPE from a CPE descriptor.
+ The CPE descriptor identifies the CPE components (referencing their corresponding
+ descriptors) and specifies the various options for configuring the CPM and deploying
+ the CPE components.</para>
+
+ <figure id="ugr.tug.cpe.fig.cpe_instantiation">
+ <title>CPE Instantiation</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="5.7in" format="PNG"
+ fileref="&imgroot;image018.png"/>
+ </imageobject>
+ <textobject><phrase>Picture of deployment of a CPE</phrase></textobject>
+ </mediaobject>
+ </figure>
+
+ <para id="ugr.tug.cpe.deployment_alternatives">There are three deployment modes
+ for CAS Processors (Analysis Engines and CAS Consumers) in a CPE:</para>
+
+ <orderedlist><listitem><para><emphasis role="bold">Integrated</emphasis> (runs
+ in the same Java instance as the CPM)</para></listitem>
+
+ <listitem><para><emphasis role="bold">Managed</emphasis> (runs in a separate
+ process on the same machine), and</para></listitem>
+
+ <listitem><para><emphasis role="bold">Non-managed</emphasis> (runs in a
+ separate process, perhaps on a different machine). </para></listitem>
+ </orderedlist>
+
+ <para>An integrated CAS Processor runs in the same JVM as the CPE. A managed CAS Processor
+ runs in a separate process from the CPE, but still on the same computer. The CPE controls
+ startup, shutdown, and recovery of a managed CAS Processor. A non-managed CAS
+ Processor runs as a service and may be on the same computer as the CPE or on a remote
+ computer. A non-managed CAS Processor <emphasis role="bold-italic">
+ service</emphasis> is started and managed independently from the CPE.</para>
+
+ <para>For both managed and non-managed CAS Processors, the CAS must be transmitted
+ between separate processes and possibly between separate computers. This is
+ accomplished using <emphasis>Vinci</emphasis>, a communication protocol used by
+ the CPM and which is provided as a part of Apache UIMA. Vinci handles service naming and
+ location and data transport (see <olink targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.application.how_to_deploy_a_vinci_service"/> for more
+ information). Service naming and location are provided by a <emphasis>Vinci Naming
+ Service</emphasis>, or <emphasis>VNS</emphasis>. For managed CAS Processors, the
+ CPE uses its own internal VNS. For non-managed CAS Processors, a separate VNS must be
+ running.</para> <note><para>The UIMA SDK also supports using unmanaged remote
+ services via the web-standard SOAP communications protocol (see <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.application.how_to_deploy_as_soap"/>. This approach is
+ based on a proxy implementation, where the proxy is essentially running in an integrated
+ mode. To use this approach with the CPM, use the Integrated mode, with the component being
+ an Aggregate which, in turn, connects to a remote service. </para></note>
+
+ <para>The CPE Configurator tool currently only supports constructing CPEs that deploy
+ CAS Processors in integrated mode. To deploy CAS Processors in any other mode, the CPE
+ descriptor must be edited by hand (better tooling may be provided later). Details on the
+ CPE descriptor and the required settings for various CAS Processor deployment modes
+ can be found in <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/>
+ . In the following sections we merely summarize the various CAS Processor deployment
+ options.</para>
+
+ <section id="ugr.tug.cpe.managed_deployment">
+ <title>Deploying Managed CAS Processors</title>
+
+ <para>Managed CAS Processor deployment is shown in <xref
+ linkend="ugr.tug.cpe.fig.managed_deployment"/>. A managed CAS Processor is
+ deployed by the CPE as a Vinci service. The CPE manages the lifecycle of the CAS
+ Processor including service launch, restart on failures, and service shutdown. A
+ managed CAS Processor runs on the same machine as the CPE, but in a separate process.
+ This provides the necessary fault isolation for the CPE to protect it from non-robust
+ CAS Processors. A fatal failure of a managed CAS Processor does not threaten the
+ stability of the CPE.</para>
+
+ <figure id="ugr.tug.cpe.fig.managed_deployment">
+ <title>CPE with Managed CAS Processors</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="3.6in" format="PNG"
+ fileref="&imgroot;image020.png"/>
+ </imageobject>
+ <textobject><phrase>Managed deployment showing separate JVMs and CASes
+ flowing between them</phrase></textobject>
+ </mediaobject>
+ </figure>
+
+ <para>The CPE communicates with managed CAS Processors using the Vinci communication
+ protocol. A CAS Processor is launched as a Vinci service and its
+ <literal>process()</literal> method is invoked remotely via a Vinci command. The
+ CPE uses its own internal VNS to support managed CAS processors. The VNS, by default,
+ listens on port 9005. If this port is not available, the VNS will increment its listen
+ port until it finds one that is available. All managed CAS Processors are internally
+ configured to <quote>talk</quote> to the CPE managed VNS. This internal VNS is
+ transparent to the end user launching the CPE.</para>
+
+ <para>To deploy a managed CAS Processor, the CPE deployer must change the CPE
+ descriptor. The following is a section from the CPE descriptor that shows an example
+ configuration specifying a managed CAS Processor.</para>
+
+
+ <programlisting><casProcessor <emphasis role="bold-italic">deployment="local"</emphasis> name="Meeting Detector TAE">
+ <descriptor>
+ <include href="deploy/vinci/Deploy_MeetingDetectorTAE.xml"/>
+ </descriptor>
+ <runInSeparateProcess>
+ <exec dir="." executable="java">
+ <env key="CLASSPATH"
+ value="src;
+ C:/Program Files/apache/uima/lib/uima-core.jar;
+ C:/Program Files/apache/uima/lib/uima-cpe.jar;
+ C:/Program Files/apache/uima/lib/uima-examples.jar;
+ C:/Program Files/apache/uima/lib/uima-adapter-vinci.jar;
+ C:/Program Files/apache/uima/lib/jVinci.jar"/>
+ <arg>-DLOG=C:/Temp/service.log</arg>
+ <arg>org.apache.uima.reference_impl.collection.
+ service.vinci.VinciAnalysisEnginerService_impl</arg>
+ <arg>${descriptor}</arg>
+ </exec>
+ </runInSeparateProcess>
+ <deploymentParameters/>
+ <filter/>
+ <errorHandling>
+ <errorRateThreshold action="terminate" value="1/100"/>
+ <maxConsecutiveRestarts action="terminate" value="3"/>
+ <timeout max="100000"/>
+ </errorHandling>
+ <checkpoint batch="10000"/>
+</casProcessor></programlisting>
+
+ <para>See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for
+ details and required settings.</para>
+
+ </section>
+
+ <section id="ugr.tug.cpe.deploying_nonmanaged_cas_processors">
+ <title>Deploying Non-managed CAS Processors</title>
+
+ <para>Non-managed CAS Processor deployment is shown in <xref
+ linkend="ugr.tug.cpe.fig.nonmanaged_cpe"/>. In non-managed mode, the CPE
+ supports connectivity to CAS Processors running on local or remote computers using
+ Vinci. Non-managed processors are different from managed processors in two
+ aspects:
+
+ <orderedlist><listitem><para>Non-managed processors are neither started nor
+ stopped by the CPE.</para></listitem>
+
+ <listitem><para>Non-managed processors use an independent VNS, also neither
+ started nor stopped by the CPE. </para></listitem></orderedlist></para>
+
+ <figure id="ugr.tug.cpe.fig.nonmanaged_cpe">
+ <title>CPE with non-managed CAS Processors</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="4.8in" format="PNG"
+ fileref="&imgroot;image023.png"/>
+ </imageobject>
+ <textobject><phrase>Non-managed CPE deployment</phrase></textobject>
+ </mediaobject>
+ </figure>
+
+ <para>While non-managed CAS Processors provide the same level of fault isolation and
+ robustness as managed CAS Processors, error recovery support for non-managed CAS
+ Processors is much more limited. In particular, the CPE cannot restart a non-managed
+ CAS Processor after an error.</para>
+
+ <para>Non-managed CAS Processors also require a separate Vinci Naming Service
+ running on the network. This VNS must be manually started and monitored by the end user
+ or application. Instructions for running a VNS can be found in <olink
+ targetdoc="&uima_docs_tutorial_guides;"
+ targetptr="ugr.tug.application.vns.starting"/>.</para>
+
+ <para>To deploy a non-managed CAS Processor, the CPE deployer must change the CPE
+ descriptor. The following is a section from the CPE descriptor that shows an example
+ configuration for the non-managed CAS Processor.</para>
+
+
+ <programlisting><casProcessor <emphasis role="bold-italic">deployment="remote"</emphasis> name="Meeting Detector TAE">
+ <descriptor>
+ <include href=
+ "descriptors/vinciService/MeetingDetectorVinciService.xml"/>
+ </descriptor>
+ <deploymentParameters/>
+ <filter/>
+ <errorHandling>
+ <errorRateThreshold action="terminate" value="1/100"/>
+ <maxConsecutiveRestarts action="terminate" value="3"/>
+ <timeout max="100000"/>
+ </errorHandling>
+ <checkpoint batch="10000"/>
+</casProcessor></programlisting>
+
+ <para>See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for
+ details and required settings.</para>
+
+ </section>
+
+ <section id="ugr.tug.cpe.integrated_deployment">
+ <title>Deploying Integrated CAS Processors</title>
+
+ <para>Integrated CAS Processors are shown in <xref
+ linkend="ugr.tug.cpe.fig.integrated_deployment"/>. Here the CAS Processors
+ run in the same JVM as the CPE, just like the Collection Reader and CAS Initializer.
+ This deployment method results in minimal CAS communication and transport overhead
+ as the CAS is shared in the same process space of the JVM. However, a CPE running with all
+ integrated CAS Processors is limited in scalability by the capability of the single
+ computer on which the CPE is running. There is also a stability risk associated with
+ integrated processors because a poorly written CAS Processor can cause the JVM, and
+ hence the entire CPE, to abort.</para>
+
+ <figure id="ugr.tug.cpe.fig.integrated_deployment">
+ <title>CPE with integrated CAS Processor</title>
+ <mediaobject>
+ <imageobject>
+ <imagedata width="3.2in" format="PNG"
+ fileref="&imgroot;image026.png"/>
+ </imageobject>
+ <textobject><phrase>CPE with integrated CAS Processor</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+
+ <para>The following is a section from a CPE descriptor that shows an example
+ configuration for the integrated CAS Processor.</para>
+
+
+ <programlisting><casProcessor <emphasis role="bold-italic">deployment=<quote>integrated</quote></emphasis> name=<quote>Meeting Detector TAE</quote>>
+ <descriptor>
+ <include href="descriptors/tutorial/ex4/MeetingDetectorTAE.xml"/>
+ </descriptor>
+ <deploymentParameters/>
+ <filter/>
+ <errorHandling>
+ <errorRateThreshold action="terminate" value="100/1000"/>
+ <maxConsecutiveRestarts action="terminate" value="30"/>
+ <timeout max="100000"/>
+ </errorHandling>
+ <checkpoint batch="10000"/>
+</casProcessor></programlisting>
+
+ <para>See <olink targetdoc="&uima_docs_ref;" targetptr="ugr.ref.xml.cpe_descriptor"/> for
+ details and required settings.</para>
+
+ </section>
+ </section>
+
+ <section id="ugr.tug.cpe.collection_processing_examples">
+ <title>Collection Processing Examples</title>
+
+ <para>The UIMA SDK includes a set of examples illustrating the three modes of deployment,
+ integrated, managed, and non-managed. These are in the
+ <literal>/examples/descriptors/collection_processing_engine</literal>
+ directory. There are three CPE descriptors that run an example annotator (the Meeting
+ Finder) in these modes.</para>
+
+ <para>To run either the integrated or managed examples, use the
+ <literal>runCPE</literal> script in the /bin directory of the UIMA installation,
+ passing the appropriate CPE descriptor as an argument, or
+ if you're using Eclipse and have the <literal>uimaj-examples</literal> project in your
+ workspace, you can use the Eclipse Menu → Run → Run... → and then pick the
+ launch configuration <quote>UIMA Run CPE</quote>.</para>
+
+ <note><para>The <literal>runCPE</literal> script <emphasis role="bold-italic"> must</emphasis>
+ be run from the <literal>%UIMA_HOME%\examples</literal> directory, because the example
+ CPE descriptors use relative path names that are resolved relative to this working directory.
+ For instance,
+
+ <literallayout>runCPE
+descriptors\collection_processing_engine\MeetingFinderCPE_Integrated.xml</literallayout></para>
+ </note>
+
+ <!--
+ <para>If you installed the examples into Eclipse, you can run directly from Eclipse by
+ creating a run configuration. To do this, highlight the SimpleRunCPE.java source file
+ in the examples src/org/apache/uima/examples/cpe directory, and then</para>
+
+ <orderedlist><listitem><para>pick the menu Run → Run...</para></listitem>
+
+ <listitem><para>click <quote>Java Application</quote> and press
+ <quote>New</quote></para></listitem>
+
+ <listitem><para>click on the Arguments panel, and insert a path to the appropriate CPE
+ descriptor in the <quote>Program Arguments</quote> box by typing, for instance:
+ <literal>descriptors/collection_processing_engine/
+ MeetingFinderCPE_Integrated.xml</literal>
+ </para></listitem>
+
+ <listitem><para>Then press <quote>Run</quote> </para></listitem>
+ </orderedlist>
+ -->
+
+ <para>To run the non-managed example, there are some additional steps.
+
+ <orderedlist><listitem><para>Start a VNS service by running the
+ <literal>startVNS</literal> script in the <literal>/bin</literal>
+ directory, or using the Eclipse launcher <quote>UIMA Start VNS</quote>.</para></listitem>
+
+ <listitem><para>Deploy the Meeting Detector Analysis Engine as a Vinci service, by
+ running the <literal>startVinciService</literal> script in the
+ <literal>/bin</literal> directory or using the Eclipse launcher for this, and passing it the location of the
+ descriptor to deploy, in this case
+ <literal>%UIMA_HOME%/examples/deploy/vinci/Deploy_MeetingDetectorTAE.xml</literal>,
+ or
+ if you're using Eclipse and have the <literal>uimaj-examples</literal> project in your
+ workspace, you can use the Eclipse Menu → Run → Run... → and then pick the
+ launch configuration <quote>UIMA Start Vinci Service</quote>.
+ </para></listitem>
+
+ <listitem><para>Now, run the runCPE script (or if in Eclipse, run the
+ launch configuration <quote>UIMA Run CPE</quote>), passing it the CPE for the non-managed
+ version
+ <literal>(%UIMA_HOME%/examples/descriptors/collection_processing_engine/
+ MeetingFinderCPE_NonManaged.xml</literal>
+ ). </para></listitem></orderedlist></para>
+
+ <para>This assumes that the Vinci Naming Service, the runCPE application, and the
+ <literal>MeetingDetectorTAE</literal> service are all running on the same machine.
+ Most of the scripts that need information about VNS will look for values to use in
+ environment variables VNS_HOST and VNS_PORT; these default to
+ <quote>localhost</quote> and <quote>9000</quote>. You may set these to appropriate
+ values before running the scripts, as needed; you can also pass the name of the VNS host as
+ the second argument to the startVinciService script.</para>
+
+ <para>Alternatively, you can edit the scripts and/or the XML files to specify
+ alternatives for the VNS_HOST and VNS_PORT. For instance, if the
+ <literal>runCPE</literal> application is running on a different machine from the
+ Vinci Naming Service, you can edit the
+ <literal>MeetingFinderCPE_NonManaged.xml</literal> and change the vnsHost
+ parameter:
+ <literal><parameter name="vnsHost" value="localhost" type="string"/></literal>
+ to specify the VNS host instead of <quote>localhost</quote>.</para>
+ </section>
+
+</chapter>
+