You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by pk...@apache.org on 2012/11/05 17:56:05 UTC
svn commit: r1405876 -
/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml
Author: pkluegl
Date: Mon Nov 5 16:56:04 2012
New Revision: 1405876
URL: http://svn.apache.org/viewvc?rev=1405876&view=rev
Log:
UIMA-2285
- added some more information about configuration parameters
- fixed formatting
Modified:
uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml
Modified: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml?rev=1405876&r1=1405875&r2=1405876&view=diff
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml (original)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml Mon Nov 5 16:56:04 2012
@@ -5,814 +5,603 @@
<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
%uimaents;
]>
-<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
- license agreements. See the NOTICE file distributed with this work for additional
- information regarding copyright ownership. The ASF licenses this file to
- you under the Apache License, Version 2.0 (the "License"); you may not use
- this file except in compliance with the License. You may obtain a copy of
- the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
- by applicable law or agreed to in writing, software distributed under the
- License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
- OF ANY KIND, either express or implied. See the License for the specific
- language governing permissions and limitations under the License. -->
-
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
<chapter id="ugr.tools.tm.introduction">
- <title>TextMarker</title>
- <para>The TextMarker system is an open source tool
- for the development
- of rule-based information extraction applications.
- The development
- environment is based on the DLTK framework. It
- supports the knowledge
- engineer with a full-featured rule editor,
- components for the
- explanation of the rule inference and a build
- process for generic UIMA
- Analysis Engines and Type Systems.
- Therefore TextMarker components can
- be easily created and combined
- with other UIMA components in different
- information extraction
- pipelines rather flexibly.
-
- TextMarker applies a
- specialized rule representation language for the effective
- knowledge
- formalization:
- The rules of the TextMarker language are composed of a
- list of rule
- elements that themselves consists of four parts: The
- mandatory
- matching condition establishes a connection to the input
- document by
- referring to an already existing concept, respectively
- annotation.
- The
- optional quantifier defines the usage of the matching
- condition
- similar to regular expressions. Then, additional conditions
- add
- constraints to the matched text fragment and additional actions
- determine the consequences of the rule. Therefore, TextMarker rules
- match on a pattern of given annotations and, if the additional
- conditions evaluate true, then they execute their actions, e.g.
- create
- a new annotation. If no initial annotations exist, for example,
- created by another component, a scanner is used to seed simple token
- annotations contained in a taxonomy.
-
- The TextMarker system provides
- unique functionality that is usually not
- found in similar systems. The
- actions are able to modify the document
- either by replacing or
- deleting
- text fragments or by filtering the
- view on the document. In
- this case,
- the rules ignore some
- annotations,
- e.g. HTML markup, or are
- executed only
- on the remaining text passages.
- The knowledge engineer
- is able to add
- heuristic knowledge by using
- scoring rules.
- Additionally, several
- language elements common to
- scripting languages
- like conditioned
- statements, loops, procedures,
- recursion, variables
- and expressions
- increase the expressiveness of
- the language. Rules are
- able to directly
- invoke external rule sets or
- arbitrary UIMA Analysis
- Engines and foreign
- libraries can be
- integrated with the extension
- mechanism for new
- language elements.
-
- </para>
- <section id="ugr.tools.tm.introduction.metaphor">
- <title>Introduction</title>
- <para>
- In manual information extraction humans often apply a strategy
- according to a highlighter metaphor: First relevant headlines are
- considered and classified according to their content by coloring
- them
- with different highlighters. The paragraphs of the annotated
- headlines
- are then considered further. Relevant text fragments or
- single words
- in the context of that headline can then be colored. In
- this way, a
- top-down analysis and extraction strategy is implemented.
- Necessary
- additional information can then be added that either refers
- to other
- text segments or contains valuable domain specific
- information.
- Finally the colored text can be easily analyzed
- concerning the
- relevant information.
-
- The TextMarker system (textmarker
- is a common german word for a
- highlighter) tries to imitate this
- manual extraction method by
- formalizing the appropriate actions using
- matching rules: The rules
- mark sequences of words, extract text
- segments or modify the input
- document depending on textual
- features.The default input for the
- TextMarker system is
- semi-structured text, but it can also process
- structured or free
- text.
- Technically, HTML is often the input
- format,
- since most word
- processing
- documents can be converted to HTML.
- Additionally, the
- TextMarker
- systems offers the possibility to
- create
- a modified output
- document.
- </para>
- </section>
- <section id="ugr.tools.tm.introduction.concepts">
- <title>Core Concepts</title>
- <para>
- As a first step in the extraction process the TextMarker system uses
- a
- tokenizer (scanner) to tokenize the input document and to create a
- stream of basic symbols. The types and valid annotations of the
- possible tokens are predefined by a taxonomy of annotation types.
- Annotations simply refer to a section of the input document and
- assign a type or concept to the respective text fragment. The figure
- on the right shows an excerpt of a basic annotation taxonomy: CW
- describes all tokens, for example, that contains a single word
- starting with a capital letter, MARKUP corresponds to HTML or XML
- tags, and PM refers to all kinds of punctuations marks. Take a look
- at [basic annotations|BasicAnnotationList] for a complete list of
- initial annotations.
-
-
- <screenshot>
- <mediaobject>
- <imageobject>
- <imagedata scale="80" format="PNG" fileref="&imgroot;symboltaxo.png" />
- </imageobject>
- <textobject>
- <phrase>Part of a taxonomy for basic annotation types.</phrase>
- </textobject>
- </mediaobject>
- </screenshot>
-
- By using (and extending) the taxonomy, the knowledge engineer is
- able
- to choose the most adequate types and concepts when defining new
- matching rules, i.e., TextMarker rules for matching a text fragment
- given by a set of symbols to an annotation. If the capitalization of
- a word, for example, is of no importance, then the annotation type W
- that describes words of any kind can be used. The initial scanner
- creates a set of basic annotations that may be used by the matching
- rules of the TextMarker language. However, most information
- extraction applications require domain specific concepts and
- annotations. Therefore, the knowledge engineer is able to extend the
- set of annotations, and to define new annotation types tuned to the
- requirements of the given domain. These types can be flexibly
- integrated in the taxonomy of annotation types.
-
- One of the goals in
- developing a new information extraction language
- was
- to maintain an
- easily readable syntax while still providing a
- scalable
- expressiveness
- of the language. Basically, the TextMarker
- language
- contains
- expressions for the definition of new annotation
- types and
- for defining
- new matching rules. The rules are defined by a
- list of
- rule elements.
- Each rule element contains at least a basic matching
- condition
- referring
- to text fragments or already specified
- annotations.
- Additionally a
- list of conditions and actions may be
- specified for a
- rule element.
- Whereas the conditions describe
- necessary attributes of
- the matched
- text fragment, the actions point
- to operations and
- assignments on
- the
- current fragments. These actions
- will then only be
- executed if all
- basic conditions matched on a text
- fragment or the
- annotation and the
- related conditions are fulfilled.
- </para>
- </section>
- <section id="ugr.tools.tm.introduction.examples">
- <title>Examples</title>
- <para>
- The usage of the language and its readability can be demonstrated by
- simple examples:
-
- <programlisting><![CDATA[
- CW{INLIST('animals.txt') -> MARK(Animal)};
- Animal "and" Animal{-> MARK(Animalpair, 1, 2, 3)};
- ]]></programlisting>
-
- The first rule looks at all capitalized words that are listed in an
- external document animals.txt and creates a new annotation of the
- type
- animal using the boundaries of the matched word. The second rule
- searches for an annotation of the type animal followed by the
- literal
- and and a second animal annotation. Then it will create a new
- annotation animalpair covering the text segment that matched the
- three
- rule elements (the digit parameters refer to the number of
- matched
- rule element).
-
- <programlisting><![CDATA[
- Document{-> MARKFAST(Firstname, 'firstnames.txt')};
- Firstname CW{-> MARK(Lastname)};
- Paragraph{VOTE(Firstname, Lastname) -> LOG("Found more Firstnames than Lastnames")};
- ]]></programlisting>
-
- In this example, the first rule annotates all words that occur in
- the
- external document firstnames.txt with the type firstname. The
- second
- rule creates a lastname annotation for all capitalized word
- that
- follow a firstname annotation. The last rule finally processes
- all
- paragraph} annotations. If the VOTE condition counts more
- firstname
- than lastname annotations, then the rule writes a log entry
- with a
- predefined message.
-
-
- <programlisting><![CDATA[
- ANY+{PARTOF(Paragraph), CONTAINS(Delete, 50, 100, true) -> MARK(Delete)};
- Firstname{-> MARK(Delete,1 , 2)} Lastname;
- Delete{-> DEL};
- ]]></programlisting>
-
- Here, the first rule looks for sequences of any kind of tokens
- except
- markup and creates one annotation of the type delete for each
- sequence, if the tokens are part of a paragraph annotation and
- contains together already more than 50% of delete annoations. The +
- signs indicate this greedy processing. The second rule annotates
- first
- names followed by last names with the type delete and the third
- rule
- simply deletes all text segments that are associated with that
- delete
- annotation.
-
- </para>
- </section>
- <section id="ugr.tools.tm.introduction.features">
- <title>Special Features</title>
- <para>
- The TextMarker language features some special characteristics
- that are
- usually not found in other rule-based information extraction
- systems
- or even shift it towards scripting languages. The possibility
- of
- creating new annotation types and integrating them into the
- taxonomy
- facilitates an even more modular development of information
- extraction systems.
-
- Read more about robust extraction using
- filtering,
- complex control
- structures and heuristic extraction using
- scoring
- rules.
- </para>
- </section>
- <section id="ugr.tools.tm.introduction.getstarted">
- <title>Get started</title>
- <para>
- This section page gives you a short, technical introduction on
- how to
- get
- started with TextMarker system and mostly just links the
- information
- of the other wiki pages. Some knowledge about the usage
- of
- Eclipse and
- central concepts of UIMA are useful. TextMarker
- consists of
- the
- TextMarker rule language (and of course the rule
- inference) and the
- TextMarker workbench. Additionally, the CEV plugin
- is used to edit
- and
- visualize annotated text. The TextRuler system
- with implementations of
- well known rule learning methods and
- development extension with
- support for test-driven development are
- already integrated.
- </para>
- <section id="ugr.tools.tm.introduction.getstarted.running">
- <title>Up and running</title>
- <para>
- First of all, install the Workbench and read the introduction
- and its
- examples. In order to verify if the Workbench is correctly
- installed,
- take a look at Help-About Eclipse-Installation Details
- and
- compare
- the installed plugins with the plugins you copied into
- the
- plugins
- folder of your Eclipse application. Normally most of the
- plugins do
- not cause any troubles, but the CEV does because of the
- XPCom and
- XULRunner dependencies. You should at least get the XPCom
- plugin up
- and running. However, you cannot use the additional HTML
- functionality without the XULRunner plugin. If the plugins of the
- installation guide do not work properly and a google search for a
- suiteable plugin is not successful, then write a mail to the user
- list and we will try to solve the problem. If all plugins are
- correctly installed, then start the Eclipse application and switch
- to
- the TextMarker perspective (Window-Open Perspective-Other...)
- </para>
- </section>
- <section id="ugr.tools.tm.introduction.getstarted.example">
- <title>Learn by example</title>
- <para>
- Having a running Workbench download the example project and
- import/copy
- this TextMarker project into your workspace. The project
- contains
- some simple rules for extraction the author, title and year
- of
- reference strings. Next, take a look at the project structure and
- the
- syntax and compare it with the example project and its contents.
- Open
- the Main.tm TextMarker script in the folder
- script/de.uniwue.example
- and press the Run button in the Eclipse
- toolbar. The docments in
- the
- input folder will then be processed by
- the Main.tm file and the
- result of the information extraction task
- is
- placed in the output
- folder. As you can see, there are four
- files: an
- xmiCAS for each
- input file and a HTML file (the
- modifed/colored
- result). Open one of
- the .xmi files with the CAS
- Editor plugin (-popup
- menu-Open with) and
- select some checkboxes in
- the Annotation Browser
- view.
- </para>
- </section>
- <section id="ugr.tools.tm.introduction.getstarted.doit">
- <title>Do it yourself</title>
- <para>
- Try to write some rules yourself. Read the description of the
- available
- language constructs, e.g., conditions and actions and use
- the
- explanation component in order to take a closer look at the rule
- inference. Then finally, read the rest of this document.
- </para>
- </section>
- </section>
- <section id="ugr.tools.tm.ae">
- <title>TextMarker Analysis Engine</title>
- <para>
- - TextMarker in UIMA, only a AE which is parameterized and
- controlled
- by that.
- </para>
- <section id="ugr.tools.tm.ae.parameter">
- <title>Configuration Parameters</title>
- <para>
- The configuration parameters of the TextMarker analysis engines can
- be separated into three different groups: parameters for the setup
- of
- the environment (
- <xref linkend='ugr.tools.tm.ae.parameter.mainScript' />
- to
- <xref linkend='ugr.tools.tm.ae.parameter.additionalExtensions' />
- ), parameters that change the behavior of the analysis engine (
- <xref linkend='ugr.tools.tm.ae.parameter.reloadScript' />
- to
- <xref linkend='ugr.tools.tm.ae.parameter.simpleGreedyForComposed' />
- ) and parameters for creating additional information how the rules
- were executed (
- <xref linkend='ugr.tools.tm.ae.parameter.debug' />
- to
- <xref linkend='ugr.tools.tm.ae.parameter.createdBy' />
- ). First, a short overview of the configuration parameters is given
- in
- <ref linkend='table.ugr.tools.tm.ae.parameter' />
- . Then all parameters are described in detail with examples.
-
- <table id="table.ugr.tools.tm.ae.parameter" frame="all">
- <title>Configuration parameters of the TextMarker Analysis Engine
- </title>
- <tgroup cols="3" colsep="1" rowsep="1">
- <colspec colname="c1" colwidth="1.2*" />
- <colspec colname="c2" colwidth="2*" />
- <colspec colname="c3" colwidth="0.8*" />
- <thead>
- <row>
- <entry align="center">Name</entry>
- <entry align="center">Short description</entry>
- <entry align="center">Type</entry>
- </row>
- </thead>
- <tbody>
- <row>
- <entry>
- mainScript
- <ref linkend='ugr.tools.tm.ae.parameter.mainScript' />
- </entry>
- <entry>
- Name with complete namespace of the script which will be
- interpreted and executed by the analysis engine.
- </entry>
- <entry>
- Single String
- </entry>
- </row>
- <row>
- <entry>scriptEncoding</entry>
- <entry>
- Encoding of all TextMarker script files.
- </entry>
- <entry>
- Single String
- </entry>
- </row>
- <row>
- <entry>scriptPaths</entry>
- <entry>
- List of absolute locations, which contain the neccessary
- script files like the main script.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>descriptorPaths</entry>
- <entry>
- List of absolute locations, which contain the neccessary
- descriptor files like type systems.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>resourcePaths</entry>
- <entry>
- List of absolute locations, which contain the neccessary
- resource files like word lists.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>additionalScripts</entry>
- <entry>
- List of names with complete namespace of additional
- scripts, which can be referred to.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>additionalEngines</entry>
- <entry>
- List of names with complete namespace of additional
- analysis engines, which can be called by TextMarker rules.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>additionalEngineLoaders</entry>
- <entry>
- List of class names of implementations that are able to
- perform additional task when loading external analysis engines.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>additionalExtensions</entry>
- <entry>
- List of factory classes for additional extensions of the
- TextMarker language like proprietary conditions.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
-
- <row>
- <entry>reloadScript</entry>
- <entry>
- Option to initialize the rule script each time the
- analysis engine
- processes a CAS.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>seeders</entry>
- <entry>
- List of class names that provide additional annoations
- before the rules are executed.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>defaultFilteredTypes</entry>
- <entry>
- List of complete type names of annoations that are
- invisible by default.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>removeBasics</entry>
- <entry>
- Option to remove all inference annoations after execution
- of the rule script.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>dynamicAnchoring</entry>
- <entry>
- Option to allow rule matches to start at any rule
- element.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>lowMemoryProfile</entry>
- <entry>
- Option to decrease the memory consumption when processing
- a large CAS.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>simpleGreedyForComposed</entry>
- <entry>
- Option to activate a different inferencer for composed
- rule elements.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
-
-
- <row>
- <entry>debug</entry>
- <entry>
- Option to add debug information to the CAS.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>debugWithMatches</entry>
- <entry>
- Option to add information about the rule matches to the
- CAS.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>debugOnlyFor</entry>
- <entry>
- List of rule ids. If provided, then debug information is
- only created for those rules.
- </entry>
- <entry>
- Multi String
- </entry>
- </row>
- <row>
- <entry>profile</entry>
- <entry>
- Option to add profile information to the CAS.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>statistics</entry>
- <entry>
- Option to add statistics of conditions and actions to the
- CAS.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
- <row>
- <entry>createdBy</entry>
- <entry>
- Option to add additional information, which rule created
- a annotation.
- </entry>
- <entry>
- Single Boolean
- </entry>
- </row>
-
-
-
- </tbody>
- </tgroup>
- </table>
- </para>
- <section id="ugr.tools.tm.ae.parameter.mainScript">
- <title>mainScript</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.scriptEncoding">
- <title>scriptEncoding</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.scriptPaths">
- <title>scriptPaths</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.descriptorPaths">
- <title>descriptorPaths</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.resourcePaths">
- <title>resourcePaths</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.additionalScripts">
- <title>additionalScripts</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.additionalEngines">
- <title>additionalEngines</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.additionalEngineLoaders">
- <title>additionalEngineLoaders</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.additionalExtensions">
- <title>additionalExtensions</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.reloadScript">
- <title>reloadScript</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.seeders">
- <title>seeders</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.defaultFilteredTypes">
- <title>defaultFilteredTypes</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.removeBasics">
- <title>removeBasics</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.dynamicAnchoring">
- <title>dynamicAnchoring</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.lowMemoryProfile">
- <title>lowMemoryProfile</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.simpleGreedyForComposed">
- <title>simpleGreedyForComposed</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.debug">
- <title>debug</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.debugWithMatches">
- <title>debugWithMatches</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.debugOnlyFor">
- <title>debugOnlyFor</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.profile">
- <title>profile</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.statistics">
- <title>statistics</title>
- <para>
- </para>
- </section>
- <section id="ugr.tools.tm.ae.parameter.createdBy">
- <title>createdBy</title>
- <para>
- </para>
- </section>
-
- </section>
- </section>
+ <title>TextMarker</title>
+ <para>The TextMarker system is an open source tool for the development of rule-based information
+ extraction applications. The development environment is based on the DLTK framework. It supports
+ the knowledge engineer with a full-featured rule editor, components for the explanation of the
+ rule inference and a build process for generic UIMA Analysis Engines and Type Systems. Therefore
+ TextMarker components can be easily created and combined with other UIMA components in different
+ information extraction pipelines rather flexibly. TextMarker applies a specialized rule
+ representation language for the effective knowledge formalization: The rules of the TextMarker
+ language are composed of a list of rule elements that themselves consists of four parts: The
+ mandatory matching condition establishes a connection to the input document by referring to an
+ already existing concept, respectively annotation. The optional quantifier defines the usage of
+ the matching condition similar to regular expressions. Then, additional conditions add
+ constraints to the matched text fragment and additional actions determine the consequences of
+ the rule. Therefore, TextMarker rules match on a pattern of given annotations and, if the
+ additional conditions evaluate true, then they execute their actions, e.g. create a new
+ annotation. If no initial annotations exist, for example, created by another component, a
+ scanner is used to seed simple token annotations contained in a taxonomy. The TextMarker system
+ provides unique functionality that is usually not found in similar systems. The actions are able
+ to modify the document either by replacing or deleting text fragments or by filtering the view
+ on the document. In this case, the rules ignore some annotations, e.g. HTML markup, or are
+ executed only on the remaining text passages. The knowledge engineer is able to add heuristic
+ knowledge by using scoring rules. Additionally, several language elements common to scripting
+ languages like conditioned statements, loops, procedures, recursion, variables and expressions
+ increase the expressiveness of the language. Rules are able to directly invoke external rule
+ sets or arbitrary UIMA Analysis Engines and foreign libraries can be integrated with the
+ extension mechanism for new language elements.
+
+ </para>
+ <section id="ugr.tools.tm.introduction.metaphor">
+ <title>Introduction</title>
+ <para> In manual information extraction humans often apply a strategy according to a highlighter
+ metaphor: First relevant headlines are considered and classified according to their content by
+ coloring them with different highlighters. The paragraphs of the annotated headlines are then
+ considered further. Relevant text fragments or single words in the context of that headline
+ can then be colored. In this way, a top-down analysis and extraction strategy is implemented.
+ Necessary additional information can then be added that either refers to other text segments
+ or contains valuable domain specific information. Finally the colored text can be easily
+ analyzed concerning the relevant information.The TextMarker system (textmarker is a common
+ german word for a highlighter) tries to imitate this manual extraction method by formalizing
+ the appropriate actions using matching rules: The rules mark sequences of words, extract text
+ segments or modify the input document depending on textual features.The default input for the
+ TextMarker system is semi-structured text, but it can also process structured or free text.
+ Technically, HTML is often the input format, since most word processing documents can be
+ converted to HTML. Additionally, the TextMarker systems offers the possibility to create a
+ modified output document.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.concepts">
+ <title>Core Concepts</title>
+ <para>
+ As a first step in the extraction process the TextMarker system uses a tokenizer (scanner) to
+ tokenize the input document and to create a stream of basic symbols. The types and valid
+ annotations of the possible tokens are predefined by a taxonomy of annotation types.
+ Annotations simply refer to a section of the input document and assign a type or concept to
+ the respective text fragment. The figure on the right shows an excerpt of a basic annotation
+ taxonomy: CW describes all tokens, for example, that contains a single word starting with a
+ capital letter, MARKUP corresponds to HTML or XML tags, and PM refers to all kinds of
+ punctuations marks. Take a look at [basic annotations|BasicAnnotationList] for a complete list
+ of initial annotations.
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG" fileref="&imgroot;symboltaxo.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Part of a taxonomy for basic annotation types.</phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+ By using (and extending) the taxonomy, the knowledge engineer is able to choose the most
+ adequate types and concepts when defining new matching rules, i.e., TextMarker rules for
+ matching a text fragment given by a set of symbols to an annotation. If the capitalization of
+ a word, for example, is of no importance, then the annotation type W that describes words of
+ any kind can be used. The initial scanner creates a set of basic annotations that may be used
+ by the matching rules of the TextMarker language. However, most information extraction
+ applications require domain specific concepts and annotations. Therefore, the knowledge
+ engineer is able to extend the set of annotations, and to define new annotation types tuned to
+ the requirements of the given domain. These types can be flexibly integrated in the taxonomy
+ of annotation types. One of the goals in developing a new information extraction language was
+ to maintain an easily readable syntax while still providing a scalable expressiveness of the
+ language. Basically, the TextMarker language contains expressions for the definition of new
+ annotation types and for defining new matching rules. The rules are defined by a list of rule
+ elements. Each rule element contains at least a basic matching condition referring to text
+ fragments or already specified annotations. Additionally a list of conditions and actions may
+ be specified for a rule element. Whereas the conditions describe necessary attributes of the
+ matched text fragment, the actions point to operations and assignments on the current
+ fragments. These actions will then only be executed if all basic conditions matched on a text
+ fragment or the annotation and the related conditions are fulfilled.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.examples">
+ <title>Examples</title>
+ <para>
+ The usage of the language and its readability can be demonstrated by simple examples:
+ <programlisting><![CDATA[CW{INLIST('animals.txt') -> MARK(Animal)}; Animal "and" Animal{-> MARK(Animalpair, 1, 2, 3)}; ]]></programlisting>
+ The first rule looks at all capitalized words that are listed in an external document
+ animals.txt and creates a new annotation of the type animal using the boundaries of the
+ matched word. The second rule searches for an annotation of the type animal followed by the
+ literal and and a second animal annotation. Then it will create a new annotation animalpair
+ covering the text segment that matched the three rule elements (the digit parameters refer to
+ the number of matched rule element).
+ <programlisting><![CDATA[Document{-> MARKFAST(Firstname, 'firstnames.txt')}; Firstname CW{-> MARK(Lastname)}; Paragraph{VOTE(Firstname, Lastname) -> LOG("Found more Firstnames than Lastnames")}; ]]></programlisting>
+ In this example, the first rule annotates all words that occur in the external document
+ firstnames.txt with the type firstname. The second rule creates a lastname annotation for all
+ capitalized word that follow a firstname annotation. The last rule finally processes all
+ paragraph} annotations. If the VOTE condition counts more firstname than lastname annotations,
+ then the rule writes a log entry with a predefined message.
+ <programlisting><![CDATA[ANY+{PARTOF(Paragraph), CONTAINS(Delete, 50, 100, true) -> MARK(Delete)}; Firstname{-> MARK(Delete,1 , 2)} Lastname; Delete{-> DEL}; ]]></programlisting>
+ Here, the first rule looks for sequences of any kind of tokens except markup and creates one
+ annotation of the type delete for each sequence, if the tokens are part of a paragraph
+ annotation and contains together already more than 50% of delete annoations. The + signs
+ indicate this greedy processing. The second rule annotates first names followed by last names
+ with the type delete and the third rule simply deletes all text segments that are associated
+ with that delete annotation.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.features">
+ <title>Special Features</title>
+ <para> The TextMarker language features some special characteristics that are usually not found
+ in other rule-based information extraction systems or even shift it towards scripting
+ languages. The possibility of creating new annotation types and integrating them into the
+ taxonomy facilitates an even more modular development of information extraction systems. Read
+ more about robust extraction using filtering, complex control structures and heuristic
+ extraction using scoring rules.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.getstarted">
+ <title>Get started</title>
+ <para> This section page gives you a short, technical introduction on how to get started with
+ TextMarker system and mostly just links the information of the other wiki pages. Some
+ knowledge about the usage of Eclipse and central concepts of UIMA are useful. TextMarker
+ consists of the TextMarker rule language (and of course the rule inference) and the TextMarker
+ workbench. Additionally, the CEV plugin is used to edit and visualize annotated text. The
+ TextRuler system with implementations of well known rule learning methods and development
+ extension with support for test-driven development are already integrated.
+ </para>
+ <section id="ugr.tools.tm.introduction.getstarted.running">
+ <title>Up and running</title>
+ <para> First of all, install the Workbench and read the introduction and its examples. In
+ order to verify if the Workbench is correctly installed, take a look at Help-About
+ Eclipse-Installation Details and compare the installed plugins with the plugins you copied
+ into the plugins folder of your Eclipse application. Normally most of the plugins do not
+ cause any troubles, but the CEV does because of the XPCom and XULRunner dependencies. You
+ should at least get the XPCom plugin up and running. However, you cannot use the additional
+ HTML functionality without the XULRunner plugin. If the plugins of the installation guide do
+ not work properly and a google search for a suiteable plugin is not successful, then write a
+ mail to the user list and we will try to solve the problem. If all plugins are correctly
+ installed, then start the Eclipse application and switch to the TextMarker perspective
+ (Window-Open Perspective-Other...)
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.getstarted.example">
+ <title>Learn by example</title>
+ <para> Having a running Workbench download the example project and import/copy this TextMarker
+ project into your workspace. The project contains some simple rules for extraction the
+ author, title and year of reference strings. Next, take a look at the project structure and
+ the syntax and compare it with the example project and its contents. Open the Main.tm
+ TextMarker script in the folder script/de.uniwue.example and press the Run button in the
+ Eclipse toolbar. The docments in the input folder will then be processed by the Main.tm file
+ and the result of the information extraction task is placed in the output folder. As you can
+ see, there are four files: an xmiCAS for each input file and a HTML file (the
+ modifed/colored result). Open one of the .xmi files with the CAS Editor plugin (-popup
+ menu-Open with) and select some checkboxes in the Annotation Browser view.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.getstarted.doit">
+ <title>Do it yourself</title>
+ <para> Try to write some rules yourself. Read the description of the available language
+ constructs, e.g., conditions and actions and use the explanation component in order to take
+ a closer look at the rule inference. Then finally, read the rest of this document.
+ </para>
+ </section>
+ </section>
+ <section id="ugr.tools.tm.ae">
+ <title>TextMarker Analysis Engine</title>
+ <para> Description of TextMarker and other Analysis Engines</para>
+ <section id="ugr.tools.tm.ae.parameter">
+ <title>Configuration Parameters</title>
+ <para>
+ The configuration parameters of the TextMarker analysis engines can be separated into three
+ different groups: parameters for the setup of the environment (
+ <link linkend='ugr.tools.tm.ae.parameter.mainScript'> mainScript</link>
+ to
+ <link linkend='ugr.tools.tm.ae.parameter.additionalExtensions'> additionalExtensions</link>
+ ), parameters that change the behavior of the analysis engine (
+ <link linkend='ugr.tools.tm.ae.parameter.reloadScript'> reloadScript</link>
+ to
+ <link linkend='ugr.tools.tm.ae.parameter.simpleGreedyForComposed'> simpleGreedyForComposed</link>
+ ) and parameters for creating additional information how the rules were executed (
+ <link linkend='ugr.tools.tm.ae.parameter.debug'> debug</link>
+ to
+ <link linkend='ugr.tools.tm.ae.parameter.createdBy'> createdBy</link>
+ ). First, a short overview of the configuration parameters is given in
+ <xref linkend='table.ugr.tools.tm.ae.parameter' />
+ . Then all parameters are described in detail with examples.
+ <table id="table.ugr.tools.tm.ae.parameter" frame="all">
+ <title>Configuration parameters of the TextMarker Analysis Engine </title>
+ <tgroup cols="3" colsep="1" rowsep="1">
+ <colspec colname="c1" colwidth="1.2*" />
+ <colspec colname="c2" colwidth="2*" />
+ <colspec colname="c3" colwidth="0.8*" />
+ <thead>
+ <row>
+ <entry align="center">Name</entry>
+ <entry align="center">Short description</entry>
+ <entry align="center">Type</entry>
+ </row>
+ </thead>
+ <tbody>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.mainScript'>mainScript</link>
+ </entry>
+ <entry>Name with complete namespace of the script which will be interpreted and
+ executed by the analysis engine.
+ </entry>
+ <entry>Single String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.scriptEncoding'>scriptEncoding</link>
+ </entry>
+ <entry>Encoding of all TextMarker script files.</entry>
+ <entry>Single String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.scriptPaths'>scriptPaths</link>
+ </entry>
+ <entry>List of absolute locations, which contain the neccessary script files like
+ the main script.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.descriptorPaths'>descriptorPaths</link>
+ </entry>
+ <entry>List of absolute locations, which contain the neccessary descriptor files
+ like type systems.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.resourcePaths'>resourcePaths</link>
+ </entry>
+ <entry>List of absolute locations, which contain the neccessary resource files like
+ word lists.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.additionalScripts'>additionalScripts</link>
+ </entry>
+ <entry>List of names with complete namespace of additional scripts, which can be
+ referred to.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.additionalEngines'>additionalEngines</link>
+ </entry>
+ <entry>List of names with complete namespace of additional analysis engines, which
+ can be called by TextMarker rules.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.additionalEngineLoaders'>additionalEngineLoaders</link>
+ </entry>
+ <entry>List of class names of implementations that are able to perform additional
+ task when loading external analysis engines.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.additionalExtensions'>additionalExtensions</link>
+ </entry>
+ <entry>List of factory classes for additional extensions of the TextMarker language
+ like proprietary conditions.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.reloadScript'>reloadScript</link>
+ </entry>
+ <entry>Option to initialize the rule script each time the analysis engine processes
+ a CAS.
+ </entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.seeders'>seeders</link>
+ </entry>
+ <entry>List of class names that provide additional annoations before the rules are
+ executed.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.defaultFilteredTypes'>defaultFilteredTypes</link>
+ </entry>
+ <entry>List of complete type names of annoations that are invisible by default.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.removeBasics'>removeBasics</link>
+ </entry>
+ <entry>Option to remove all inference annoations after execution of the rule script.
+ </entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.dynamicAnchoring'>dynamicAnchoring</link>
+ </entry>
+ <entry>Option to allow rule matches to start at any rule element.</entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.lowMemoryProfile'>lowMemoryProfile</link>
+ </entry>
+ <entry>Option to decrease the memory consumption when processing a large CAS.
+ </entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.simpleGreedyForComposed'>simpleGreedyForComposed</link>
+ </entry>
+ <entry>Option to activate a different inferencer for composed rule elements.</entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.debug'>debug</link>
+ </entry>
+ <entry>Option to add debug information to the CAS.</entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.debugWithMatches'>debugWithMatches</link>
+ </entry>
+ <entry>Option to add information about the rule matches to the CAS.</entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.debugOnlyFor'>debugOnlyFor</link>
+ </entry>
+ <entry>List of rule ids. If provided, then debug information is only created for
+ those rules.
+ </entry>
+ <entry>Multi String</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.profile'>profile</link>
+ </entry>
+ <entry>Option to add profile information to the CAS.</entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.statistics'>statistics</link>
+ </entry>
+ <entry>Option to add statistics of conditions and actions to the CAS.</entry>
+ <entry>Single Boolean</entry>
+ </row>
+ <row>
+ <entry>
+ <link linkend='ugr.tools.tm.ae.parameter.createdBy'>createdBy</link>
+ </entry>
+ <entry>Option to add additional information, which rule created a annotation.
+ </entry>
+ <entry>Single Boolean</entry>
+ </row>
+
+ </tbody>
+ </tgroup>
+ </table>
+ </para>
+ <section id="ugr.tools.tm.ae.parameter.mainScript">
+ <title>mainScript</title>
+ <para>
+ This parameter specifies the rule file that will be executed by the analysis engine and is
+ therefore one of the most important ones. The extact name of the script is given by the complete namespace of the file, which correspond to its location
+ relative to the given parameter <link linkend='ugr.tools.tm.ae.parameter.scriptPaths'>scriptPaths</link>.
+ The single names of packages (or folders) are separated by periods. An exemplary value for this parameter could be "org.apache.uima.Main",
+ whereas "Main" specifies the file containing the rules and "org.apache.uima" its package.
+ In this case, the analysis engine loads the script file "Main.tm", which is located in the folder structure "org/apache/uima/".
+ This parameter has no default value and ha sto be provided, although it is not specified as mandatory.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.scriptEncoding">
+ <title>scriptEncoding</title>
+ <para>
+ This parameter specifies the encoding of the rule files. Its default value is "UTF-8".
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.scriptPaths">
+ <title>scriptPaths</title>
+ <para>
+ The parameter scriptPaths refers to a list of String values, which specify the possible locations of script files.
+ The given locations are absolute paths. A typical value for this parameter is for example "C:/TextMarker/MyProject/script/".
+ If the parameter <link linkend='ugr.tools.tm.ae.parameter.mainScript'>mainScript</link> is set to org.apache.uima.Main,
+ then the absolute path of the script file has to be "C:/TextMarker/MyProject/script/org/apache/uima/Main.tm".
+ This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.descriptorPaths">
+ <title>descriptorPaths</title>
+ <para>
+ This parameter specifies the possible locations for descriptors like analysis engines or type systems, similar to the parameter
+ <link linkend='ugr.tools.tm.ae.parameter.scriptPaths'>scriptPaths</link> for the script files. A typical value for this parameter
+ is for example "C:/TextMarker/MyProject/descriptor/".
+ The relative values of the parameter <link linkend='ugr.tools.tm.ae.parameter.additionalEngines'>additionalEngines</link> are
+ resolved to these absolute locations.
+ This parameter can contain multiple values, as the main script can refer to multiple projects similar to a class path in Java.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.resourcePaths">
+ <title>resourcePaths</title>
+ <para>
+ This parameter specifies the possible locations of additional resources like word lists or CSV tables. The string values have to contain absolute
+ locations, for example, "C:/TextMarker/MyProject/resources/".
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.additionalScripts">
+ <title>additionalScripts</title>
+ <para>
+ The parameter additionalScripts is defined as a list of string values and contains script files,
+ which are additionally loaded by the analysis engine. These script files are specified by their
+ complete namespace, exactly like the value of the parameter <link linkend='ugr.tools.tm.ae.parameter.mainScript'>mainScript</link>
+ and can be refered to by language elements, e.g., by executing the containing rules. An exemplary
+ value of this parameter is "org.apache.uima.SecondaryScript". In this example, the main script could import
+ this script file by the declaration "SCRIPT org.apache.uima.SecondaryScript;" and then could execute it with the rule
+ "Document{-> CALL(SecondaryScript)};".
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.additionalEngines">
+ <title>additionalEngines</title>
+ <para>
+ This parameter contains a list of additional analysis engines, which can be executed by the TextMarker rules. The single values
+ are given by the name of the analysis engine with their complete namespace and have to be located relative to one value of the parameter
+ <link linkend='ugr.tools.tm.ae.parameter.descriptorPaths'>descriptorPaths</link>, the location, where the analysis engine searches for the descriptor file.
+ An exmaple for one value of the parameter is "utils.HtmlAnnotator", which points to the descriptor "HtmlAnnotator.xml" in the folder "utils".
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.additionalEngineLoaders">
+ <title>additionalEngineLoaders</title>
+ <para>
+ The parameter "additionalEngineLoaders" specifies are list of optional implementations of the interface
+ "org.apache.uima.textmarker.extensions.IEngineLoader", which can be used to application-specific configurations of
+ additional analysis engines.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.additionalExtensions">
+ <title>additionalExtensions</title>
+ <para>
+ This parameter specifies optional extensions of the TextMarker language. The elements of the string list must implement the interface
+ "org.apache.uima.textmarker.extensions.ITextMarkerExtension". With those extensions, application-specific conditions and actions can be
+ added to the set of provided ones.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.reloadScript">
+ <title>reloadScript</title>
+ <para>
+ This boolean parameter indicates wether the script or resource files should be reloaded when processing a cas. The default value is set to false.
+ In this case, the script files are loaded when the analysis engine is initialized. If script files or resource files are extended, e.g., a dictionary is filled
+ yet when a collection of documents are processed, then the parameter is need to be set to true in order to include the changes.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.seeders">
+ <title>seeders</title>
+ <para>
+ This list of string values refer to implementations of the interface "org.apache.uima.textmarker.seed.TextMarkerAnnotationSeeder",
+ which can be used to automatically add annotations to the CAS. The default value of the parameter is a single seeder, namely "org.apache.uima.textmarker.seed.DefaultSeeder"
+ that adds annotations for token classes like CW, MARKUP or SEMICOLON. Remember that additional annoations can also be added with
+ an additional engine that is executed by a TextMarker rule.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.defaultFilteredTypes">
+ <title>defaultFilteredTypes</title>
+ <para>
+ This parameter specifies a list of types, which are filtered by default when executing a script file. Using the default values of this parameter,
+ whitespaces, line breaks and markup elements are not visible to TextMarker rules. The visibility of annoations and therefore the covered text can be changed
+ using the actions <link linkend='ugr.tools.tm.language.actions.filtertype'>FILTERTYPE</link> and
+ <link linkend='ugr.tools.tm.language.actions.retaintype'>RETAINTYPE</link>.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.removeBasics">
+ <title>removeBasics</title>
+ <para>
+ This parameter specifies whether the inference annoations created by the analysis engine should be removed after processing the CAS.
+ The default value is set to false.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.dynamicAnchoring">
+ <title>dynamicAnchoring</title>
+ <para>
+ If this parameter is set to true, then the TextMarker rules are not forced to start to match with the first rule element.
+ Rather the rule element referring to the most rare type is chosen. Therefore, this option can be utilized to optimize the performance.
+ Please mind that the matching result can vary in some cases when greedy rule elements are applied.
+ The default value is set to false.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.lowMemoryProfile">
+ <title>lowMemoryProfile</title>
+ <para>
+ This parameter specifies whether the memory consumption should be reduced. This parameter should be set to true for
+ very large CAS documents (e.g., > 500k tokens), but it also reduces the performance. The default value is set to false.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.simpleGreedyForComposed">
+ <title>simpleGreedyForComposed</title>
+ This parameter specifies whether a different inference strategy for composed rule elements should be applied. This option is only neccessary,
+ if the composed rule element is expected to match very often, e.g., a rule element like (ANY ANY).
+ The default value of this parameter is set to false.
+ <para>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.debug">
+ <title>debug</title>
+ <para>
+ If this parameter is set to true, then additional information about the execution of a rule script is added to the CAS.
+ The actual information is specified by the following parameters.
+ The default value of this parameter is set to false.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.debugWithMatches">
+ <title>debugWithMatches</title>
+ <para>
+ This parameter specificies whether the match information (covered text) of the rules should be stored in the CAS.
+ The default value of this parameter is set to false.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.debugOnlyFor">
+ <title>debugOnlyFor</title>
+ <para>
+ This parameter specifies a list of rule ids that enumeratethe rule for which debug information should be created.
+ No specific ids are given by default.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.profile">
+ <title>profile</title>
+ <para>
+ If this parameter is set to true, then additional information about the runtime of applied rules is added to the CAS.
+ The default value of this parameter is set to false.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.statistics">
+ <title>statistics</title>
+ <para>
+ If this parameter is set to true, then additional information about the runtime of TextMarker lanuage elements like conditions and actions
+ is added to the CAS.
+ The default value of this parameter is set to false.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.ae.parameter.createdBy">
+ <title>createdBy</title>
+ <para>
+ If this parameter is set to true, then additional information is added to the CAS about what annotation was created by which rule.
+ The default value of this parameter is set to false.
+ </para>
+ </section>
+ </section>
+ </section>
</chapter>
\ No newline at end of file