You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by pk...@apache.org on 2012/07/20 14:27:15 UTC
svn commit: r1363750 [3/3] - in /uima/sandbox/trunk/TextMarker/uima-docbook-textmarker: ./ src/docbook/

Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml?rev=1363750&view=auto
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml (added)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml Fri Jul 20 12:27:14 2012
@@ -0,0 +1,1483 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+<!ENTITY imgroot "images/tools/tools.textmarker/" >
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >  
+%uimaents;
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
+	license agreements. See the NOTICE file distributed with this work for additional 
+	information regarding copyright ownership. The ASF licenses this file to 
+	you under the Apache License, Version 2.0 (the "License"); you may not use 
+	this file except in compliance with the License. You may obtain a copy of 
+	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
+	by applicable law or agreed to in writing, software distributed under the 
+	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
+	OF ANY KIND, either express or implied. See the License for the specific 
+	language governing permissions and limitations under the License. -->
+
+<chapter id="ugr.tools.tm.workbench">
+	<title>TextMarker Workbench</title>
+	<para>
+	</para>
+
+	<section id="ugr.tools.tm.install">
+		<title>Installation</title>
+		<para>
+			# Download, install and start an Eclipse 3.5 or Eclipse
+			3.6.
+			#
+			Add the Apache UIMA update site
+			(http://www.apache.org/dist/uima/eclipse-update-site/) and the
+			TextMarker update site
+			(http://ki.informatik.uni-wuerzburg.de/~pkluegl/updatesite/) to the
+			available software sites in your Eclipse installation. This can be
+			achived in the "Install New Software" dialog in the help menu of
+			Eclipse.
+			# Eclipse 3.6: TextMarker is currently based on DLTK
+			1.0.
+			Therefore, adding the DLTK 1.0 update site
+			(http://download.eclipse.org/technology/dltk/updates-dev/1.0/) is
+			required since the Eclipse 3.6 update site only supports DLTK 2.0.
+			#
+			Select "Install New Software" in the help menu of Eclipse, if not
+			done yet.
+			# Select the TextMarker update site at "Work with",
+			deselect "Group
+			items by category" and select "Contact all update
+			sites during
+			install to find required software"
+			# Select the
+			TextMarker feature and continue the dialog. The CEV
+			feature is
+			already contained in the TextMarker feature. Eclipse will
+			automatically install the Apache UIMA (version 2.3) plugins and the
+			DLTK Core Framework (version 1.X) plugins.
+			# ''(OPTIONAL)'' If
+			additional HTML visualizations are desired, then
+			also install the CEV
+			HTML feature. However, you need to install the
+			XPCom and XULRunner
+			features previously, for example by using an
+			appropriate update site
+			(http://ftp.mozilla.org/pub/mozilla.org/xulrunner/eclipse/). Please
+			refer to the [CEV installation instruction|CEVInstall] for details.
+			# After the successful installation, switch to the TextMarker
+			perspective.
+
+			You can also download the TextMarker plugins from
+			[SourceForge.net|https://sourceforge.net/projects/textmarker/] and
+			install the plugins mentioned above manually.
+		</para>
+	</section>
+	<section id="ugr.tools.tm.project">
+		<title>TextMarker Projects</title>
+		<para>
+			Similar to Java projects in Eclipse, the TextMarker workbench
+			provides the possibility to create TextMarker projects. TextMarker
+			projects require a certain folder structure that is created with the
+			project. The most important folders are the script folder that
+			contains the TextMarker rule files in a package and the descriptor
+			folder that contains the generated UIMA components. The input folder
+			contains the text files or xmiCAS files that will be executed when
+			starting a TextMarker script. The result will be placed in the
+			output folder.
+
+			<programlisting><![CDATA[
+  ||Project element|| Used for
+  | Project                   | the TextMarker project
+  | - script                  | source folder with TextMarker scripts
+  | -- my.package                 | the package, resulting in several folders 
+  | --- Script.tm                 | a TextMarker script
+  | - descriptor              | build folder for UIMA components
+  | -- my/package                 | the folder structure for the components
+  | --- ScriptEngine.xml          | the analysis engine of the Script.tm script
+  | --- ScriptTypeSystem.xml      | the type system of the Script.tm script
+  | -- BasicEngine.xml            | the analysis engine template for all generated engines in this project 
+  | -- BasicTypeSystem.xml        | the type system template for all generated type systems in this project
+  | -- InternalTypeSystem.xml     | a type system with TextMarker types
+  | -- Modifier.xml               | the analysis engine of the optional modifier that creates the ''modified'' view
+  | - input                   | folder that contains the files that will be processed when launching a TextMarker script
+  | -- test.html                  | an input file containing html
+  | -- test.xmi                   | an input file containing text and annotations
+  | - output                  | folder that contains the files that were processed by a TextMarker script
+  | -- test.html.modified.html    | the result of the modifier: replaced text and colored html
+  | -- test.html.xmi              | the result CAS with optional information
+  | -- test.xmi.modified.html     | the result of the modifier: replaced text and colored html
+  | -- test.xmi.xmi               | the result CAS with optional information
+  | - resources               | default folder for word lists and dictionaries
+  | -- Dictionary.mtwl            | a dictionary in the "multi tree word list" format
+  | -- FirstNames.txt             | a simple word list with first names:  one first name per line
+  | - test                    | test-driven development is still under construction
+]]></programlisting>
+
+		</para>
+
+	</section>
+	<section id="ugr.tools.tm.explain">
+		<title>Explanation</title>
+		<para>
+			Handcrafting rules is laborious, especially if the newly
+			written rules do not
+			behave as expected. The TextMarker System is
+			able to protocol the
+			application of each single rule and block in
+			order to provide an
+			explanation of the rule inference and a minmal
+			debug functionality.
+
+			The explanation component is built upon the CEV
+			plugin. The
+			information about the application of the rules itself is
+			stored in
+			the result xmiCAS, if the parameter of the executed engine
+			are
+			configured correctly. The simplest way the generate these
+			information is to open a TextMarker file and click on the common
+			"Debug" button (looks like a green bug) in your eclipse. The current
+			TextMarker file will then be executed on the text files in the input
+			directory and xmiCAS are created in the output directory containing
+			the additional UIMA feature structures describing the rule
+			inference. The resulting xmiCAS needs to be opened with the CEV
+			plugin. However, only additional views are capable of displaying the
+			debug information. In order to open the neccessary views, you can
+			either open the "Explain" perspective or open the views separately
+			and arrange them as you like.
+
+			There are currently seven views that
+			display information about the
+			execution of the rules: Applied Rules,
+			Selected Rules, Rule List,
+			Matched Rules, Failed Rules, Rule Elements
+			and Basic Stream.
+
+		</para>
+
+	</section>
+	<section id="ugr.tools.tm.dictionaries">
+		<title>Dictionariers</title>
+		<para>
+
+			The TextMarker system suports currently the usage of dictionaries in
+			four different ways. The files are always encoded with UTF-8. The
+			generated analysis engines provide a parameter "resourceLocation"
+			that specifies the folder that contains the external dictionary
+			files. The paramter is initially set to the resource folder of the
+			current TextMarker project. In order to use a different folder,
+			change for example set value of the paramter and rebuild all
+			TextMarker rule files in the project in order to update all analysis
+			engines.
+
+			The algorithm for the detection of the entires of a
+			dictionary:
+
+			<programlisting><![CDATA[
+for all basic annotations of the matched annotation do
+  set current candidate to current basic
+  loop
+    if the dictionary contains current candidate then
+      remember candidate
+    else if an entry of the dictionary starts with the current candidate then
+      add next basic annotation to the current candidate
+      continue loop
+    else
+      stop loop
+]]></programlisting>
+
+
+
+
+			Word List (.txt)
+			Word lists are simple text files that contain a term
+			or string in each
+			line. The strings may include white spaces and are
+			sperated by a
+			line break.
+
+			Usage:
+			Content of a file named FirstNames.txt
+			(located in the resource folder of a
+			TextMarker project):
+			<programlisting><![CDATA[
+Peter
+Jochen
+Joachim
+Martin
+]]></programlisting>
+
+			Examplary rules:
+			<programlisting><![CDATA[
+LIST FirstNameList = 'FirstNames.txt';
+DECLARE FirstName;
+Document{-> MARKFAST(FirstName, FirstNameList)};
+]]></programlisting>
+
+			In this example, all first names in the given text file are
+			annotated in the input document with the type FirstName.
+
+			Tree Word
+			List (.twl)
+			A tree word list is a compiled word list similar to a
+			trie. A .twl
+			file is an XML-file that contains a tree-like structure
+			with a node
+			for each character. The nodes themselves refer to child
+			nodes that
+			represent all characters that succeed the caracter of the
+			parent
+			node. For single word entries, this is resulting in a
+			complexity of
+			O(m*log(n)) instead of a complexity of O(m*n) (simple
+			.txt file),
+			whereas m is the amount of basic annotations in the
+			document and n
+			is the amount of entries in the dictionary.
+
+			Usage:
+			A
+			.twl file are generated using the popup menu. Select one or more
+			.txt files (or a folder containing .txt files), click the right
+			mouse button and choose ''Convert to TWL''. Then, one or more .twl
+			files are generated with the according file name.
+
+			Examplary rules:
+
+			<programlisting><![CDATA[
+LIST FirstNameList = 'FirstNames.twl';
+DECLARE FirstName;
+Document{-> MARKFAST(FirstName, FirstNameList)};
+]]></programlisting>
+
+			In this example, all first names in the given text file are again
+			annotated in the input document with the type FirstName.
+
+			Multi Tree
+			Word List (.mtwl)
+			A multi tree word list is generated using multiple
+			.txt files and
+			contains special nodes: Its nodes provide additional
+			information
+			about the original file. The .mtwl files are useful, if
+			several
+			different dictionaries are used in a TextMarker file. For
+			five
+			dictionaries, for example, also five MARKFAST rules are
+			necessary.
+			Therefore the matched text is searched five times and the
+			complexity
+			is 5 * O(m*log(n)). Using a .mtwl file reduces the
+			complexity to
+			about O(m*log(5*n)).
+
+			Usage:
+			A .mtwl file is generated
+			using the popup menu. Select one or more
+			.txt files (or a folder
+			containing .txt files), click the right
+			mouse button and choose
+			''Convert to MTWL''. A .mtwl file named
+			"generated.mtwl" is then
+			generated that contains the word lists of
+			all selected .txt files.
+			Renaming the .mtwl file is recommended.
+
+
+			If there are for example two
+			or more word lists with the name
+			"FirstNames.txt", "Companies.txt"
+			and so on given and the generated
+			.mtwl file is renamed to
+			"Dictionary.mtwl", then the following rule
+			annotates all companies
+			and first names in the complete document.
+
+			Examplary rules:
+
+			<programlisting><![CDATA[
+LIST Dictionary = 'Dictionary.mtwl';
+DECLARE FirstName, Company;
+Document{-> TRIE("FirstNames.txt" = FirstName, "Companies.txt" = Company, Dictionary, false, 0, false, 0, "")};
+]]></programlisting>
+
+
+
+
+			Table (.csv)
+			The TextMarker system also supports .csv files,
+			respectively tables.
+
+			Usage:
+			Content of a file named TestTable.csv
+			(located in the resource folder of a
+			TextMarker project):
+			<programlisting><![CDATA[
+Peter;P;
+Jochen;J;
+Joba;J;
+]]></programlisting>
+
+			Examplary rules:
+			<programlisting><![CDATA[
+PACKAGE de.uniwue.tm;
+TABLE TestTable = 'TestTable.csv';
+DECLARE Annotation Struct (STRING first);
+Document{-> MARKTABLE(Struct, 1, TestTable, "first" = 2)};
+]]></programlisting>
+			In this example, the document is searched for all occurences of the
+			entries of the first column of the given table, an annotation of the
+			type Struct is created and its feature "first" is filled with the
+			entry of the second column.
+
+			For the input document with the content
+			"Peter" the result is a single
+			annotation of the type Struct and with
+			P assigned to its features
+			"first".
+
+		</para>
+
+	</section>
+	<section id="ugr.tools.tm.parameters">
+		<title>Parameters</title>
+		<para>
+			<itemizedlist>
+				<listitem>
+					<para>mainScript (String): This is the TextMarker script that
+						will
+						be loaded and executed by the generated engine. The string
+						is
+						referencing the name of the file without file extension but
+						with
+						its complete namespace, e.g., my.package.Main.
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>scriptPaths (Multiple Strings): The given strings
+						specify the
+						folders that contain TextMarker script files, the
+						main script file
+						and the additional script files in particular.
+						Currently, there is
+						only one folder supported in the TextMarker
+						workbench (script).
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>enginePaths (Multiple Strings): The given strings
+						specify the
+						folders that contain additional analysis engines that
+						are called
+						from within a script file. Currently, there is only
+						one folder
+						supported in the TextMarker workbench (descriptor).
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>resourcePaths (Multiple Strings): The given strings
+						specify
+						the folders that contain the word lists and dictionaries.
+						Currently, there is only one folder supported in the TextMarker
+						workbench (resources).
+
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>additionalScripts (Multiple Strings): This parameter
+						contains a list of all known script files references with their
+						complete namespace, e.g., my.package.AnotherOne.
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>additionalEngines (Multiple Strings): This parameter
+						contains a list of all known analysis engines.
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>additionalEngineLoaders (Multiple Strings): This
+						parameter
+						contains the class names of the implementations that
+						help to load
+						more complex analysis engines.
+
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>scriptEncoding (String): The encoding of the script
+						files.
+						Not yet supported, please use UTF-8.
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>defaultFilteredTypes (Multiple Strings): The complete
+						names
+						of the types that are filtered by default.
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>defaultFilteredMarkups (Multiple Strings): The names of
+						the
+						markups that are filtered by default.
+
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>seeders (Multiple Strings):
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>useBasics (String):
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>removeBasics (Boolean):
+
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>debug (Boolean):
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>profile (Boolean):
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>debugWithMatches (Boolean):
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>statistics (Boolean):
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>debugOnlyFor (Multiple Strings):
+
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>style (Boolean):
+					</para>
+				</listitem>
+
+				<listitem>
+					<para>styleMapLocation (String):
+					</para>
+				</listitem>
+			</itemizedlist>
+		</para>
+
+	</section>
+	<section id="ugr.tools.tm.query">
+		<title>Query</title>
+		<para>
+			The query view can be used to write queries on several documents
+			within a folder with the TextMArker language.
+
+			A short example how to
+			use the Query view:
+			<itemizedlist>
+				<listitem>
+					<para> In the first field ''Query Data'', the folder is added in
+						which the query is executed, for example with drag and drop from
+						the script explorer. If the checkbox is activated, then all
+						subfolder will be included in the query.
+					</para>
+				</listitem>
+				<listitem>
+					<para> The next field ''Type System'' must contain a type system
+						or a TextMarker script that specifies all types that are used in
+						the query.
+					</para>
+				</listitem>
+				<listitem>
+					<para> The query in form of one or more TextMarker rules is
+						specified in the text field in the middle of the view. In the
+						example of the screenshot, all ''Author'' annotations are
+						selected that contain a ''FalsePositive'' or ''FalseNegative''
+						annotation.
+					</para>
+				</listitem>
+				<listitem>
+					<para> If the start button near the tab of the view in the upper
+						right corner ist pressed, then the results are displayed.
+					</para>
+				</listitem>
+			</itemizedlist>
+			<screenshot>
+				<mediaobject>
+					<imageobject>
+						<imagedata scale="80" format="PNG" fileref="&imgroot;Query.png" />
+					</imageobject>
+					<textobject>
+						<phrase>Query View</phrase>
+					</textobject>
+				</mediaobject>
+			</screenshot>
+
+		</para>
+	</section>
+	<section id="ugr.tools.tm.views">
+		<title>Views</title>
+		<para>
+
+		</para>
+		<section id="ugr.tools.tm.views.browser">
+			<title>Annotation Browser</title>
+			<para>
+			</para>
+		</section>
+		<section id="ugr.tools.tm.views.editor">
+			<title>Annotation Editor</title>
+			<para>
+			</para>
+		</section>
+		<section id="ugr.tools.tm.views.palette">
+			<title>Marker Palette</title>
+			<para>
+			</para>
+		</section>
+		<section id="ugr.tools.tm.views.selection">
+			<title>Selection</title>
+			<para>
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.stream">
+			<title>Basic Stream</title>
+			<para>
+				The basic stream contains a listing of the complete disjunct
+				partition
+				of the document by the TextMarkerBasic annotation that are
+				used for
+				the inference and the annotation seeding.
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.applied">
+			<title>Applied Rules</title>
+			<para>
+				The Applied Rules views displays how often a rule tried to
+				apply and
+				how often the rule succeeded. Additionally some profiling
+				information is added after a short verbalisation of the rule. The
+				information is structured: if BLOCK constructs were used in the
+				executed TextMarker file, the rules contained in that block will be
+				represented as child node in the tree of the view. Each TextMarker
+				file is itself a BLOCK construct named after the file. Therefore
+				the root node of the view is always a BLOCK containing the rules of
+				the executed TextMarker script. Additionally, if a rule calls a
+				different TextMarker file, then the root block of that file is the
+				child of that rule. The selection of a rule in this view will
+				directly change the information visualized in the other views.
+
+			</para>
+		</section>
+		<section id="ugr.tools.tm.views.selected">
+			<title>Selected Rules</title>
+			<para>
+				This views is very similar to the Applied Rules view, but
+				displays only
+				rules and blocks under a given selection. If the user
+				clicks on the
+				document, then an Applied Rule view is generated
+				containing only
+				element that affect that position in the document.
+				The Rule
+				Elements view then only contains match information of that
+				position, but the result of the rule element match is still
+				displayed.
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.rulelist">
+			<title>Rule List</title>
+			<para>
+				This views is very similar to the Applied Rules view and the
+				Selected
+				Rules view, but displays only rules and NO blocks under
+				a
+				given
+				selection. If the user clicks on the document, then a list
+				of
+				rules
+				is generated that matched or tried to match on that
+				position in
+				the
+				document. The Rule Elements view then only contains
+				match
+				information of that position, but the result of the rule
+				element
+				match is still displayed. Additionally, this view provides a
+				text
+				field for filtering the rules. Only those rules remain that
+				contain
+				the entered text in their verbalization.
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.matched">
+			<title>Matched Rules</title>
+			<para>
+				If a rule is selected in the Applied Rules views, then this
+				view
+				displays the instances (text passages) where this rules
+				matched.
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.failed">
+			<title>Failed Rules</title>
+			<para>
+				If a rule is selected in the Applied Rules views, then this
+				view
+				displays the instances (text passages) where this rules failed
+				to
+				match.
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.elements">
+			<title>Rule Elements</title>
+			<para>
+				If a successful or failed rule match in the Matched Rules view
+				or
+				Failed Rules view is selected, then this views contains a listing
+				of the rule elements and their conditions. There is detailed
+				information available on what text each rule element matched and
+				which condition did evavaluate true.
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.statistics">
+			<title>Statistics</title>
+			<para>
+				This views displays the used conditions and actions of the
+				TextMarker
+				language. Three numbers are given for each element: The
+				total time
+				of execution, the amount of executions and the time per
+				execution.
+			</para>
+		</section>
+		<section id="ugr.tools.tm.views.fp">
+			<title>False Positive</title>
+			<para>
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.fn">
+			<title>False Negative</title>
+			<para>
+			</para>
+		</section>
+
+		<section id="ugr.tools.tm.views.tp">
+			<title>True Positive</title>
+			<para>
+
+			</para>
+		</section>
+	</section>
+	<section id="ugr.tools.tm.testing">
+		<title>Testing</title>
+		<para>
+			The TextMarker Software comes bundled with its own testing
+			environment,
+			that allows you to test and evaluate TextMarker scripts.
+			It provides
+			full back end testing capabilities and allows you to
+			examine test
+			results in detail. As a product of the testing operation
+			a new
+			document file will be created and detailed information on how
+			well
+			the script performed in the test will be added to this document.
+		</para>
+		<section id="ugr.tools.tm.testing.overview">
+			<title>Overview</title>
+			<para>
+				The testing procedure compares a previously annotated gold standard
+				file with the result of the selected TextMarker script using an
+				evaluator. The evaluators compare the offsets of annotations in
+				both documents and, depending on the evaluator, mark a result
+				document with true positive, false positive or false negative
+				annotations. Afterwards the f1-score is calculated for the whole
+				set of tests, each test file and each type in the test file.
+				The testing environment contains the following parts :
+				<itemizedlist>
+					<listitem>
+						<para>Main view</para>
+					</listitem>
+					<listitem>
+						<para>Result views : true positive, false positive, false
+							negative view
+						</para>
+					</listitem>
+					<listitem>
+						<para>Preference page</para>
+					</listitem>
+				</itemizedlist>
+				<screenshot>
+					<mediaobject>
+						<imageobject>
+							<imagedata scale="80" format="PNG"
+								fileref="&imgroot;Screenshot_main.png" />
+						</imageobject>
+						<textobject>
+							<phrase>Eclipse with open TextMarker and testing environment.
+							</phrase>
+						</textobject>
+					</mediaobject>
+				</screenshot>
+				All control elements,that are needed for the interaction with the
+				testing environment, are located in the main view.
+				This is also
+				where test files can be selected and information, on how
+				well the
+				script performed is, displayed. During the testing process
+				a result
+				CAS file is produced that will contain new
+				annotation types like
+				true positives (tp), false positives (fp) and false
+				negatives (fn).
+				While displaying the result .xmi file in the script
+				editor,
+				additional
+				views allow easy navigation through the new annotations.
+				Additional tree
+				views, like the true positive view, display the
+				corresponding
+				annotations in a
+				hierarchic structure. This allows an
+				easy tracing of the results inside the
+				testing document. A
+				preference page allows customization of the
+				behavior
+				of the testing
+				plug-in.
+			</para>
+			<section id="ugr.tools.tm.testing.overview.main">
+				<title>Main View</title>
+				<para>
+					The following picture shows a close up view of the testing
+					environments main-view part. The toolbar contains all buttons
+					needed to operate the plug-ins. The first line shows the name of
+					the script that is going to be tested and a combo-box, where the
+					view, that should be tested, is selected. On the right follow
+					fields that will show some basic information of the results of the
+					test-run.
+					Below and on the left the test-list is located. This list
+					contains the
+					different test-files. Right besides it, you will find
+					a table with
+					statistic information. It shows a total tp, fp and fn
+					information,
+					as well as precision, recall and f1-score of every
+					test-file and
+					for every type in each file.
+					<screenshot>
+						<mediaobject>
+							<imageobject>
+								<imagedata scale="80" format="PNG"
+									fileref="&imgroot;Screenshot_testing_desc_3_resize.png" />
+							</imageobject>
+							<textobject>
+								<phrase>The main view of the testing environment.</phrase>
+							</textobject>
+						</mediaobject>
+					</screenshot>
+				</para>
+			</section>
+			<section id="ugr.tools.tm.testing.overview.result">
+				<title>Result Views</title>
+				<para>
+					This views add additional information to the CAS View, once a
+					result file is opened. Each view displays one of the following
+					annotation types in a hierarchic tree structure : true positives,
+					false positive and false negative. Adding a check mark to one of
+					the annotations in a result view, will highlight the annotation in
+					the CAS Editor.
+					<screenshot>
+						<mediaobject>
+							<imageobject>
+								<imagedata scale="80" format="PNG"
+									fileref="&imgroot;Screenshot_result.png" />
+							</imageobject>
+							<textobject>
+								<phrase>The main view of the testing environment.</phrase>
+							</textobject>
+						</mediaobject>
+					</screenshot>
+				</para>
+			</section>
+			<section id="ugr.tools.tm.testing.overview.preferences">
+				<title>Preference Page</title>
+				<para>
+					The preference page offers a few options that will modify the
+					plug-ins general behavior. For example the preloading of
+					previously collected result data can be turned off, should it
+					produce a to long loading time. An important option in the
+					preference page is the evaluator you can select. On default the
+					"exact evaluator" is selected, which compares the offsets of the
+					annotations, that are contained in the file produced by the
+					selected script, with the annotations in the test file. Other
+					evaluators will compare annotations in a different way.
+					<screenshot>
+						<mediaobject>
+							<imageobject>
+								<imagedata scale="80" format="PNG"
+									fileref="&imgroot;Screenshot_preferences.png" />
+							</imageobject>
+							<textobject>
+								<phrase>The preference page of the testing environment.
+								</phrase>
+							</textobject>
+						</mediaobject>
+					</screenshot>
+				</para>
+			</section>
+			<section id="ugr.tools.tm.testing.overview.project">
+				<title>The TextMarker Project Structure</title>
+				<para>
+					The picture shows the TextMarker's script explorer. Every
+					TextMarker project contains a folder called "test". This folder is
+					the default location for the test-files. In the folder each
+					script-file has its own sub-folder with a relative path equal to
+					the scripts package path in the "script" folder. This folder
+					contains the test files. In every scripts test-folder you will
+					also find a result folder with the results of the tests. Should
+					you use test-files from another location in the file-system, the
+					results will be saved in the "temp" sub-folder of the projects
+					"test" folder. All files in the "temp" folder will be deleted,
+					once eclipse is closed.
+					<screenshot>
+						<mediaobject>
+							<imageobject>
+								<imagedata scale="80" format="PNG"
+									fileref="&imgroot;folder_struc_sep_desc_cut.png" />
+							</imageobject>
+							<textobject>
+								<phrase>Script Explorer with the test folder expanded.</phrase>
+							</textobject>
+						</mediaobject>
+					</screenshot>
+				</para>
+			</section>
+		</section>
+		<section id="ugr.tools.tm.testing.usage">
+			<title>Usage</title>
+			<para>
+				This section will demonstrate how to use the testing
+				environment.
+				It will show the basic actions needed to perform a test
+				run.
+			</para>
+			<para>
+				Preparing Eclipse:
+				The testing environment provides its own
+				perspective called
+				"TextMarker Testing". It will display the main
+				view as well as the
+				different result views on the right hand side.
+				It is encouraged to
+				use this perspective, especially when working
+				with the testing
+				environment for the first time.
+			</para>
+			<para>
+				Selecting a script for testing:
+				TextMarker will always test the
+				script, that is currently open in the
+				script-editor. Should another
+				editor be open, for example a
+				java-editor with some java class being
+				displayed, you will see that
+				the testing view is not available.
+			</para>
+			<para>
+				Creating a test file:
+				A test-file is a previously annotated
+				.xmi file that can be used as
+				a golden standard for the test. To
+				create such a file, no
+				additional tools will be provided, instead
+				the TextMarker system
+				already provides such tools.
+			</para>
+			<para>
+				Selecting a test-file:
+				Test files can be added to the test-list
+				by simply dragging them from
+				the Script Explorer into the test-file
+				list. Depending on the
+				setting in the preference page, test-files
+				from a scripts "test"
+				folder might already be loaded into the list.
+				A different way to
+				add test-files is to use the "Add files from
+				folder" button. It can
+				be used to add all .xmi files from a selected
+				folder. The "del" key
+				can be used to remove files from the
+				test-list.
+			</para>
+			<para>
+				Selecting a CAS View to test:
+				TextMarker supports different
+				views, that allow you to operate on different
+				levels in a document.
+				The InitialView is selected as default,
+				however you can also switch
+				the evaluation to another view by
+				typing the views name into the
+				list or selecting the view you wish
+				to use from the list.
+			</para>
+			<para>
+				Selecting the evaluator:
+				The testing environment supports
+				different evaluators that allow a
+				sophisticated analysis of the
+				behavior of a TextMarker script. The
+				evaluator can be chosen in the
+				testing environments preference
+				page. The preference page can be
+				opened either trough the menu or
+				by clicking the blue preference
+				buttons in the testing views
+				toolbar. The default evaluator is the
+				"Exact CAS Evaluator" which
+				compares the offsets of the annotations
+				between the test file and
+				the file annotated by the tested script.
+			</para>
+			<para>
+				Excluding Types:
+				During a test-run it might be convenient to
+				disable testing for specific
+				types like punctuation or tags. The
+				''exclude types`` button will
+				open a dialog where all types can be
+				selected that should not be
+				considered in the test.
+			</para>
+			<para>
+				Running the test:
+				A test-run can be started by clicking on the
+				green start button in
+				the toolbar.
+			</para>
+			<para>
+				Result Overview:
+				The testing main view displays some
+				information, on how well the
+				script did, after every test run. It
+				will display an overall number
+				of true positive, false positive and
+				false negatives annotations of
+				all result files as well as an
+				overall f1-score. Furthermore a
+				table will be displayed that
+				contains the overall statistics of the
+				selected test file as well as
+				statistics for every single type in
+				the test file. The information
+				displayed are true positives, false
+				positives, false negatives,
+				precision, recall and f1-measure.
+			</para>
+			<para>
+				The testing environment also supports the export of the
+				overall data
+				in form of a comma-separated table. Clicking the export
+				evaluation
+				data will open a dialog window that contains this table.
+				The text
+				in this table can be copied and easily imported into
+				OpenOffice.org
+				or MS Excel.
+			</para>
+			<para>
+				Result Files:
+				When running a test, the evaluator will create a new
+				result .xmi file
+				and will add new true positive, false positive and
+				false negative
+				annotations. By clicking on a file in the test-file
+				list, you can
+				open the corresponding result .xmi file in the
+				TextMarker script
+				editor. When opening a result file in the script
+				explorer,
+				additional views will open, that allow easy access and
+				browsing of
+				the additional debugging annotations.
+				<screenshot>
+					<mediaobject>
+						<imageobject>
+							<imagedata scale="80" format="PNG"
+								fileref="&imgroot;Screenshot_Result_TP_desc_close_cut.png" />
+						</imageobject>
+						<textobject>
+							<phrase>Open result file and selected true positive annotation
+								in the true positive view.
+							</phrase>
+						</textobject>
+					</mediaobject>
+				</screenshot>
+			</para>
+		</section>
+		<section id="ugr.tools.tm.testing.evaluators">
+			<title>Evaluators</title>
+			<para>
+				When testing a CAS file, the system compared the offsets of
+				the
+				annotations of a previously annotated gold standard file with
+				the
+				offsets of the annotations
+				of the result file the script
+				produced. Responsible for comparing
+				annotations in the two CAS files
+				are evaluators. These evaluators
+				have different methods
+				and
+				strategies, for comparing the annotations, implemented. Also a
+				extension point is provided that allows easy implementation new
+				evaluators.
+			</para>
+			<para>
+				Exact Match Evaluator:
+				The Exact Match Evaluator compares the
+				offsets of the annotations in
+				the result and the golden standard
+				file. Any difference will be
+				marked with either an false positive or
+				false negative annotations.
+			</para>
+			<para>
+				Partial Match Evaluator:
+				The Partial Match Evaluator compares
+				the offsets of the annotations in
+				the result and golden standard
+				file. It will allow differences in
+				the beginning or the end of an
+				annotation. For example "corresponding" and "corresponding " will
+				not be
+				annotated as an error.
+			</para>
+			<para>
+				Core Match Evaluator:
+				The Core Match Evaluator accepts
+				annotations that share a core
+				expression. In this context a core
+				expression is at least four
+				digits long and starts with a
+				capitalized letter. For example the
+				two annotations "L404-123-421"
+				and "L404-321-412" would be
+				considered a true positive match,
+				because of "L404" is considered a
+				core expression that is contained
+				in both annotations.
+			</para>
+			<para>
+				Word Accuracy Evaluator:
+				Compares the labels of all
+				words/numbers in an annotation, whereas the
+				label equals the type of
+				the annotation. This has the consequence,
+				for example, that each
+				word or number that is not part of the
+				annotation is counted as a
+				single false negative. For example we
+				have the sentence: "Christmas
+				is on the 24.12 every year."
+				The script labels "Christmas is on the
+				12" as a single sentence, while
+				the test file labels the sentence
+				correctly with a single sentence
+				annotation. While for example the
+				Exact CAS Evaluator while only
+				assign a single False Negative
+				annotation, Word Accuracy Evaluator
+				will mark every word or number
+				as a single False Negative.
+			</para>
+			<para>
+				Template Only Evaluator:
+				This Evaluator compares the offsets of
+				the annotations and the
+				features, that have been created by the
+				script. For example the
+				text "Alan Mathison Turing" is marked with
+				the author annotation
+				and "author" contains 2 features: "FirstName"
+				and "LastName". If
+				the script now creates an author annotation with
+				only one feature,
+				the annotation will be marked as a false positive.
+			</para>
+			<para>
+				Template on Word Level Evaluator:
+				The Template On Word
+				Evaluator compares the offsets of the
+				annotations. In addition it
+				also compares the features and feature
+				structures and the values
+				stored in the features. For example the
+				annotation "author" might
+				have features like "FirstName" and
+				"LastName" The authors name is
+				"Alan Mathison Turing" and the
+				script correctly assigns the author
+				annotation. The feature
+				assigned by the script are "Firstname :
+				Alan", "LastName :
+				Mathison", while the correct feature values would
+				be "FirstName
+				Alan", "LastName Turing". In this case the Template
+				Only Evaluator
+				will mark an annotation as a false positive, since the
+				feature
+				values differ.
+			</para>
+		</section>
+
+	</section>
+	<section id="ugr.tools.tm.textruler">
+		<title>TextRuler</title>
+		<para>
+			Using the knowledge engineering approach, a knowledge engineer
+			normally
+			writes handcrafted rules to create a domain dependent
+			information
+			extraction application, often supported by a gold
+			standard. When
+			starting the engineering process for the acquisition
+			of the
+			extraction knowledge for possibly new slot or more general for
+			new
+			concepts, machine learning methods are often able to offer
+			support
+			in an iterative engineering process. This section gives a
+			conceptual
+			overview of the process model for the semi-automatic
+			development of
+			rule-based information extraction applications.
+		</para>
+		<para>
+			First, a suitable set of documents that contain the text
+			fragments with
+			interesting patterns needs to be selected and
+			annotated with the
+			target concepts. Then, the knowledge engineer
+			chooses and configures
+			the methods for automatic rule acquisition to
+			the best of his
+			knowledge for the learning task: Lambda expressions
+			based on tokens
+			and linguistic features, for example, differ in their
+			application
+			domain from wrappers that process generated HTML pages.
+		</para>
+		<para>
+			Furthermore, parameters like the window size defining relevant
+			features need to
+			be set to an appropriate level. Before the annotated
+			training
+			documents form the input of the learning task, they are
+			enriched
+			with features generated by the partial rule set of the
+			developed
+			application. The result of the methods, that is the learned
+			rules,
+			are proposed to the knowledge engineer for the extraction of
+			the
+			target concept.
+		</para>
+		<para>
+			The knowledge engineer has different options to proceed: If the
+			quality, amount or generality of the presented rules is not
+			sufficient, then additional training documents need to be annotated
+			or additional rules have to be handcrafted to provide more features
+			in general or more appropriate features. Rules or rule sets of high
+			quality can be modified, combined or generalized and transfered to
+			the rule set of the application in order to support the extraction
+			task of the target concept. In the case that the methods did not
+			learn reasonable rules at all, the knowledge engineer proceeds with
+			writing handcrafted rules.
+		</para>
+		<para>
+			Having gathered enough extraction knowledge for the current
+			concept, the
+			semi-automatic process is iterated and the focus is
+			moved to the
+			next concept until the development of the application is
+			completed.
+		</para>
+		<section id="ugr.tools.tm.textruler.learner">
+			<title>Available Learners</title>
+			<para>
+				Overview
+
+				||Name||Strategy||Document||Slots||Status
+				|BWI (1)
+				|Boosting, Top Down |Struct, Semi |Single, Boundary |Planning
+				|LP2
+				(2) |Bottom Up Cover |All |Single, Boundary |Prototype
+				|RAPIER (3)
+				|Top Down/Bottom Up Compr. |Semi |Single |Experimental
+				|WHISK (4)
+				|Top Down Cover |All |Multi |Prototype
+				|WIEN (5) |CSP |Struct
+				|Multi, Rows |Prototype
+			</para>
+			<para>
+				* Strategy: The used strategy of the learning methods are
+				commonly
+				coverage algorithms.
+				* Document: The type of the document
+				may be ''free'' like in
+				newspapers, ''semi'' or ''struct'' like HTML
+				pages.
+				* Slots: The slots refer to a single annotation that
+				represents the
+				goal of the learning task. Some rule are able to
+				create several
+				annotation at once in the same context (multi-slot).
+				However, only
+				single slots are supported by the current
+				implementations.
+				* Status: The current status of the implementation
+				in the TextRuler
+				framework.
+			</para>
+			<para>
+				Publications
+			</para>
+			<para>
+				(1) Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper
+				Induction.
+				In AAAI/IAAI, pages 577â583, 2000.
+			</para>
+			<para>
+				(2) F. Ciravegna. (LP)2, Rule Induction for Information
+				Extraction
+				Using Linguistic Constraints. Technical Report CS-03-07,
+				Department
+				of Computer Science, University of Sheffield, Sheffield,
+				2003.
+			</para>
+			<para>
+				(3) Mary Elaine Califf and Raymond J. Mooney. Bottom-up
+				Relational
+				Learning of Pattern Matching Rules for Information
+				Extraction.
+				Journal of Machine Learning Research, 4:177â210, 2003.
+			</para>
+			<para>
+				(4) Stephen Soderland, Claire Cardie, and Raymond Mooney.
+				Learning
+				Information Extraction Rules for Semi-Structured and Free
+				Text. In
+				Machine Learning, volume 34, pages 233â272, 1999.
+			</para>
+			<para>
+				(5) N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper
+				Induction for
+				Information Extraction. In Proc. IJC Artificial
+				Intelligence, 1997.
+			</para>
+			<para>
+				BWI
+				BWI (Boosted Wrapper Induction) uses boosting techniques to
+				improve
+				the performance of simple pattern matching single-slot
+				boundary
+				wrappers (boundary detectors). Two sets of detectors are
+				learned:
+				the "fore" and the "aft" detectors. Weighted by their
+				confidences
+				and combined with a slot length histogram derived from
+				the training
+				data they can classify a given pair of boundaries
+				within a
+				document. BWI can be used for structured, semi-structured
+				and free
+				text. The patterns are token-based with special wildcards
+				for more
+				general rules.
+			</para>
+			<para>
+				Implementations
+				No implementations are yet available.
+			</para>
+			<para>
+				Parameters
+				No parameters are yet available.
+
+			</para>
+			<para>
+				LP2
+				This method operates on all three kinds of documents. It
+				learns
+				separate rules for the beginning and the end of a single
+				slot. So
+				called tagging rules insert boundary SGML tags and
+				additionally
+				induced correction rules shift misplaced tags to their
+				correct
+				positions in order to improve precision. The learning
+				strategy is a
+				bottom-up covering algorithm. It starts by creating a
+				specific seed
+				instance with a window of w tokens to the left and
+				right of the
+				target boundary and searches for the best
+				generalization. Other
+				linguistic NLP-features can be used in order
+				to generalize over the
+				flat word sequence.
+			</para>
+			<para>
+				Implementations
+				LP2 (naive):
+				LP2 (optimized):
+			</para>
+			<para>
+				Parameters
+				Context Window Size (to the left and right):
+				Best
+				Rules List Size:
+				Minimum Covered Positives per Rule:
+				Maximum Error
+				Threshold:
+				Contextual Rules List Size:
+			</para>
+			<para>
+				RAPIER
+				RAPIER induces single slot extraction rules for
+				semi-structured
+				documents. The rules consist of three patterns: a
+				pre-filler, a
+				filler and a post-filler pattern. Each can hold
+				several constraints
+				on tokens and their according POS-tag- and
+				semantic information.
+				The algorithm uses a bottom-up compression
+				strategy, starting with
+				a most specific seed rule for each training
+				instance. This initial
+				rule base is compressed by randomly selecting
+				rule pairs and search
+				for the best generalization. Considering
+				two
+				rules, the least general generalization (LGG) of the slot fillers
+				are created and specialized by adding rule items to the pre- and
+				post-filler until the new rules operate well on the training set.
+				The best of the k rules (k-beam search) is added to the rule base
+				and all empirically subsumed rules are removed.
+			</para>
+			<para>
+				Implementations
+				RAPIER:
+			</para>
+			<para>
+				Parameters
+				Maximum Compression Fail Count:
+				Internal Rules List
+				Size:
+				Rule Pairs for Generalizing:
+				Maximum 'No improvement' Count:
+				Maximum Noise Threshold:
+				Minimum Covered Positives Per Rule:
+				PosTag
+				Root Type:
+				Use All 3 GenSets at Specialization:
+			</para>
+			<para>
+				WHISK
+				WHISK is a multi-slot method that operates on all three
+				kinds of
+				documents and learns single- or multi-slot rules looking
+				similar to
+				regular expressions. The top-down covering algorithm
+				begins with
+				the most general rule and specializes it by adding
+				single
+				rule terms until the rule makes no errors on the training
+				set. Domain
+				specific classes or linguistic information obtained by a
+				syntactic
+				analyzer can be used as additional features. The exact
+				definition
+				of a rule term (e.g. a token) and of a problem instance
+				(e.g. a
+				whole document or a single sentence) depends on the
+				operating
+				domain and document
+				type.
+			</para>
+			<para>
+				Implementations
+				WHISK (token):
+				WHISK (generic):
+			</para>
+			<para>
+				Parameters
+				Window Size:
+				Maximum Error Threshold:
+				PosTag Root
+				Type:
+			</para>
+			<para>
+				WIEN
+				WIEN is the only method listed here that operates on
+				highly structured
+				texts only. It induces so called wrappers that
+				anchor the slots by
+				their structured context around them. The HLRT
+				(head left right
+				tail) wrapper class for example can determine and
+				extract
+				several multi-slot-templates by first separating the
+				important information
+				block from unimportant head and tail portions
+				and then extracting
+				multiple data rows from table like
+				data
+				structures from the remaining document. Inducing a wrapper is done
+				by solving a CSP for all possible pattern combinations from the
+				training data.
+			</para>
+			<para>
+				Implementations
+				WIEN:
+			</para>
+			<para>
+				Parameters
+				No parameters are available.
+
+			</para>
+		</section>
+	</section>
+
+</chapter>
\ No newline at end of file