You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by pk...@apache.org on 2012/07/20 14:27:15 UTC
svn commit: r1363750 [3/3] - in
/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker: ./ src/docbook/
Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml?rev=1363750&view=auto
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml (added)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.workbench.xml Fri Jul 20 12:27:14 2012
@@ -0,0 +1,1483 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+<!ENTITY imgroot "images/tools/tools.textmarker/" >
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >
+%uimaents;
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor
+ license agreements. See the NOTICE file distributed with this work for additional
+ information regarding copyright ownership. The ASF licenses this file to
+ you under the Apache License, Version 2.0 (the "License"); you may not use
+ this file except in compliance with the License. You may obtain a copy of
+ the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required
+ by applicable law or agreed to in writing, software distributed under the
+ License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS
+ OF ANY KIND, either express or implied. See the License for the specific
+ language governing permissions and limitations under the License. -->
+
+<chapter id="ugr.tools.tm.workbench">
+ <title>TextMarker Workbench</title>
+ <para>
+ </para>
+
+ <section id="ugr.tools.tm.install">
+ <title>Installation</title>
+ <para>
+ # Download, install and start an Eclipse 3.5 or Eclipse
+ 3.6.
+ #
+ Add the Apache UIMA update site
+ (http://www.apache.org/dist/uima/eclipse-update-site/) and the
+ TextMarker update site
+ (http://ki.informatik.uni-wuerzburg.de/~pkluegl/updatesite/) to the
+ available software sites in your Eclipse installation. This can be
+ achived in the "Install New Software" dialog in the help menu of
+ Eclipse.
+ # Eclipse 3.6: TextMarker is currently based on DLTK
+ 1.0.
+ Therefore, adding the DLTK 1.0 update site
+ (http://download.eclipse.org/technology/dltk/updates-dev/1.0/) is
+ required since the Eclipse 3.6 update site only supports DLTK 2.0.
+ #
+ Select "Install New Software" in the help menu of Eclipse, if not
+ done yet.
+ # Select the TextMarker update site at "Work with",
+ deselect "Group
+ items by category" and select "Contact all update
+ sites during
+ install to find required software"
+ # Select the
+ TextMarker feature and continue the dialog. The CEV
+ feature is
+ already contained in the TextMarker feature. Eclipse will
+ automatically install the Apache UIMA (version 2.3) plugins and the
+ DLTK Core Framework (version 1.X) plugins.
+ # ''(OPTIONAL)'' If
+ additional HTML visualizations are desired, then
+ also install the CEV
+ HTML feature. However, you need to install the
+ XPCom and XULRunner
+ features previously, for example by using an
+ appropriate update site
+ (http://ftp.mozilla.org/pub/mozilla.org/xulrunner/eclipse/). Please
+ refer to the [CEV installation instruction|CEVInstall] for details.
+ # After the successful installation, switch to the TextMarker
+ perspective.
+
+ You can also download the TextMarker plugins from
+ [SourceForge.net|https://sourceforge.net/projects/textmarker/] and
+ install the plugins mentioned above manually.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.project">
+ <title>TextMarker Projects</title>
+ <para>
+ Similar to Java projects in Eclipse, the TextMarker workbench
+ provides the possibility to create TextMarker projects. TextMarker
+ projects require a certain folder structure that is created with the
+ project. The most important folders are the script folder that
+ contains the TextMarker rule files in a package and the descriptor
+ folder that contains the generated UIMA components. The input folder
+ contains the text files or xmiCAS files that will be executed when
+ starting a TextMarker script. The result will be placed in the
+ output folder.
+
+ <programlisting><![CDATA[
+ ||Project element|| Used for
+ | Project | the TextMarker project
+ | - script | source folder with TextMarker scripts
+ | -- my.package | the package, resulting in several folders
+ | --- Script.tm | a TextMarker script
+ | - descriptor | build folder for UIMA components
+ | -- my/package | the folder structure for the components
+ | --- ScriptEngine.xml | the analysis engine of the Script.tm script
+ | --- ScriptTypeSystem.xml | the type system of the Script.tm script
+ | -- BasicEngine.xml | the analysis engine template for all generated engines in this project
+ | -- BasicTypeSystem.xml | the type system template for all generated type systems in this project
+ | -- InternalTypeSystem.xml | a type system with TextMarker types
+ | -- Modifier.xml | the analysis engine of the optional modifier that creates the ''modified'' view
+ | - input | folder that contains the files that will be processed when launching a TextMarker script
+ | -- test.html | an input file containing html
+ | -- test.xmi | an input file containing text and annotations
+ | - output | folder that contains the files that were processed by a TextMarker script
+ | -- test.html.modified.html | the result of the modifier: replaced text and colored html
+ | -- test.html.xmi | the result CAS with optional information
+ | -- test.xmi.modified.html | the result of the modifier: replaced text and colored html
+ | -- test.xmi.xmi | the result CAS with optional information
+ | - resources | default folder for word lists and dictionaries
+ | -- Dictionary.mtwl | a dictionary in the "multi tree word list" format
+ | -- FirstNames.txt | a simple word list with first names: one first name per line
+ | - test | test-driven development is still under construction
+]]></programlisting>
+
+ </para>
+
+ </section>
+ <section id="ugr.tools.tm.explain">
+ <title>Explanation</title>
+ <para>
+ Handcrafting rules is laborious, especially if the newly
+ written rules do not
+ behave as expected. The TextMarker System is
+ able to protocol the
+ application of each single rule and block in
+ order to provide an
+ explanation of the rule inference and a minmal
+ debug functionality.
+
+ The explanation component is built upon the CEV
+ plugin. The
+ information about the application of the rules itself is
+ stored in
+ the result xmiCAS, if the parameter of the executed engine
+ are
+ configured correctly. The simplest way the generate these
+ information is to open a TextMarker file and click on the common
+ "Debug" button (looks like a green bug) in your eclipse. The current
+ TextMarker file will then be executed on the text files in the input
+ directory and xmiCAS are created in the output directory containing
+ the additional UIMA feature structures describing the rule
+ inference. The resulting xmiCAS needs to be opened with the CEV
+ plugin. However, only additional views are capable of displaying the
+ debug information. In order to open the neccessary views, you can
+ either open the "Explain" perspective or open the views separately
+ and arrange them as you like.
+
+ There are currently seven views that
+ display information about the
+ execution of the rules: Applied Rules,
+ Selected Rules, Rule List,
+ Matched Rules, Failed Rules, Rule Elements
+ and Basic Stream.
+
+ </para>
+
+ </section>
+ <section id="ugr.tools.tm.dictionaries">
+ <title>Dictionariers</title>
+ <para>
+
+ The TextMarker system suports currently the usage of dictionaries in
+ four different ways. The files are always encoded with UTF-8. The
+ generated analysis engines provide a parameter "resourceLocation"
+ that specifies the folder that contains the external dictionary
+ files. The paramter is initially set to the resource folder of the
+ current TextMarker project. In order to use a different folder,
+ change for example set value of the paramter and rebuild all
+ TextMarker rule files in the project in order to update all analysis
+ engines.
+
+ The algorithm for the detection of the entires of a
+ dictionary:
+
+ <programlisting><![CDATA[
+for all basic annotations of the matched annotation do
+ set current candidate to current basic
+ loop
+ if the dictionary contains current candidate then
+ remember candidate
+ else if an entry of the dictionary starts with the current candidate then
+ add next basic annotation to the current candidate
+ continue loop
+ else
+ stop loop
+]]></programlisting>
+
+
+
+
+ Word List (.txt)
+ Word lists are simple text files that contain a term
+ or string in each
+ line. The strings may include white spaces and are
+ sperated by a
+ line break.
+
+ Usage:
+ Content of a file named FirstNames.txt
+ (located in the resource folder of a
+ TextMarker project):
+ <programlisting><![CDATA[
+Peter
+Jochen
+Joachim
+Martin
+]]></programlisting>
+
+ Examplary rules:
+ <programlisting><![CDATA[
+LIST FirstNameList = 'FirstNames.txt';
+DECLARE FirstName;
+Document{-> MARKFAST(FirstName, FirstNameList)};
+]]></programlisting>
+
+ In this example, all first names in the given text file are
+ annotated in the input document with the type FirstName.
+
+ Tree Word
+ List (.twl)
+ A tree word list is a compiled word list similar to a
+ trie. A .twl
+ file is an XML-file that contains a tree-like structure
+ with a node
+ for each character. The nodes themselves refer to child
+ nodes that
+ represent all characters that succeed the caracter of the
+ parent
+ node. For single word entries, this is resulting in a
+ complexity of
+ O(m*log(n)) instead of a complexity of O(m*n) (simple
+ .txt file),
+ whereas m is the amount of basic annotations in the
+ document and n
+ is the amount of entries in the dictionary.
+
+ Usage:
+ A
+ .twl file are generated using the popup menu. Select one or more
+ .txt files (or a folder containing .txt files), click the right
+ mouse button and choose ''Convert to TWL''. Then, one or more .twl
+ files are generated with the according file name.
+
+ Examplary rules:
+
+ <programlisting><![CDATA[
+LIST FirstNameList = 'FirstNames.twl';
+DECLARE FirstName;
+Document{-> MARKFAST(FirstName, FirstNameList)};
+]]></programlisting>
+
+ In this example, all first names in the given text file are again
+ annotated in the input document with the type FirstName.
+
+ Multi Tree
+ Word List (.mtwl)
+ A multi tree word list is generated using multiple
+ .txt files and
+ contains special nodes: Its nodes provide additional
+ information
+ about the original file. The .mtwl files are useful, if
+ several
+ different dictionaries are used in a TextMarker file. For
+ five
+ dictionaries, for example, also five MARKFAST rules are
+ necessary.
+ Therefore the matched text is searched five times and the
+ complexity
+ is 5 * O(m*log(n)). Using a .mtwl file reduces the
+ complexity to
+ about O(m*log(5*n)).
+
+ Usage:
+ A .mtwl file is generated
+ using the popup menu. Select one or more
+ .txt files (or a folder
+ containing .txt files), click the right
+ mouse button and choose
+ ''Convert to MTWL''. A .mtwl file named
+ "generated.mtwl" is then
+ generated that contains the word lists of
+ all selected .txt files.
+ Renaming the .mtwl file is recommended.
+
+
+ If there are for example two
+ or more word lists with the name
+ "FirstNames.txt", "Companies.txt"
+ and so on given and the generated
+ .mtwl file is renamed to
+ "Dictionary.mtwl", then the following rule
+ annotates all companies
+ and first names in the complete document.
+
+ Examplary rules:
+
+ <programlisting><![CDATA[
+LIST Dictionary = 'Dictionary.mtwl';
+DECLARE FirstName, Company;
+Document{-> TRIE("FirstNames.txt" = FirstName, "Companies.txt" = Company, Dictionary, false, 0, false, 0, "")};
+]]></programlisting>
+
+
+
+
+ Table (.csv)
+ The TextMarker system also supports .csv files,
+ respectively tables.
+
+ Usage:
+ Content of a file named TestTable.csv
+ (located in the resource folder of a
+ TextMarker project):
+ <programlisting><![CDATA[
+Peter;P;
+Jochen;J;
+Joba;J;
+]]></programlisting>
+
+ Examplary rules:
+ <programlisting><![CDATA[
+PACKAGE de.uniwue.tm;
+TABLE TestTable = 'TestTable.csv';
+DECLARE Annotation Struct (STRING first);
+Document{-> MARKTABLE(Struct, 1, TestTable, "first" = 2)};
+]]></programlisting>
+ In this example, the document is searched for all occurences of the
+ entries of the first column of the given table, an annotation of the
+ type Struct is created and its feature "first" is filled with the
+ entry of the second column.
+
+ For the input document with the content
+ "Peter" the result is a single
+ annotation of the type Struct and with
+ P assigned to its features
+ "first".
+
+ </para>
+
+ </section>
+ <section id="ugr.tools.tm.parameters">
+ <title>Parameters</title>
+ <para>
+ <itemizedlist>
+ <listitem>
+ <para>mainScript (String): This is the TextMarker script that
+ will
+ be loaded and executed by the generated engine. The string
+ is
+ referencing the name of the file without file extension but
+ with
+ its complete namespace, e.g., my.package.Main.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>scriptPaths (Multiple Strings): The given strings
+ specify the
+ folders that contain TextMarker script files, the
+ main script file
+ and the additional script files in particular.
+ Currently, there is
+ only one folder supported in the TextMarker
+ workbench (script).
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>enginePaths (Multiple Strings): The given strings
+ specify the
+ folders that contain additional analysis engines that
+ are called
+ from within a script file. Currently, there is only
+ one folder
+ supported in the TextMarker workbench (descriptor).
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>resourcePaths (Multiple Strings): The given strings
+ specify
+ the folders that contain the word lists and dictionaries.
+ Currently, there is only one folder supported in the TextMarker
+ workbench (resources).
+
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>additionalScripts (Multiple Strings): This parameter
+ contains a list of all known script files references with their
+ complete namespace, e.g., my.package.AnotherOne.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>additionalEngines (Multiple Strings): This parameter
+ contains a list of all known analysis engines.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>additionalEngineLoaders (Multiple Strings): This
+ parameter
+ contains the class names of the implementations that
+ help to load
+ more complex analysis engines.
+
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>scriptEncoding (String): The encoding of the script
+ files.
+ Not yet supported, please use UTF-8.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>defaultFilteredTypes (Multiple Strings): The complete
+ names
+ of the types that are filtered by default.
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>defaultFilteredMarkups (Multiple Strings): The names of
+ the
+ markups that are filtered by default.
+
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>seeders (Multiple Strings):
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>useBasics (String):
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>removeBasics (Boolean):
+
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>debug (Boolean):
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>profile (Boolean):
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>debugWithMatches (Boolean):
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>statistics (Boolean):
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>debugOnlyFor (Multiple Strings):
+
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>style (Boolean):
+ </para>
+ </listitem>
+
+ <listitem>
+ <para>styleMapLocation (String):
+ </para>
+ </listitem>
+ </itemizedlist>
+ </para>
+
+ </section>
+ <section id="ugr.tools.tm.query">
+ <title>Query</title>
+ <para>
+ The query view can be used to write queries on several documents
+ within a folder with the TextMArker language.
+
+ A short example how to
+ use the Query view:
+ <itemizedlist>
+ <listitem>
+ <para> In the first field ''Query Data'', the folder is added in
+ which the query is executed, for example with drag and drop from
+ the script explorer. If the checkbox is activated, then all
+ subfolder will be included in the query.
+ </para>
+ </listitem>
+ <listitem>
+ <para> The next field ''Type System'' must contain a type system
+ or a TextMarker script that specifies all types that are used in
+ the query.
+ </para>
+ </listitem>
+ <listitem>
+ <para> The query in form of one or more TextMarker rules is
+ specified in the text field in the middle of the view. In the
+ example of the screenshot, all ''Author'' annotations are
+ selected that contain a ''FalsePositive'' or ''FalseNegative''
+ annotation.
+ </para>
+ </listitem>
+ <listitem>
+ <para> If the start button near the tab of the view in the upper
+ right corner ist pressed, then the results are displayed.
+ </para>
+ </listitem>
+ </itemizedlist>
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG" fileref="&imgroot;Query.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Query View</phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+
+ </para>
+ </section>
+ <section id="ugr.tools.tm.views">
+ <title>Views</title>
+ <para>
+
+ </para>
+ <section id="ugr.tools.tm.views.browser">
+ <title>Annotation Browser</title>
+ <para>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.views.editor">
+ <title>Annotation Editor</title>
+ <para>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.views.palette">
+ <title>Marker Palette</title>
+ <para>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.views.selection">
+ <title>Selection</title>
+ <para>
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.stream">
+ <title>Basic Stream</title>
+ <para>
+ The basic stream contains a listing of the complete disjunct
+ partition
+ of the document by the TextMarkerBasic annotation that are
+ used for
+ the inference and the annotation seeding.
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.applied">
+ <title>Applied Rules</title>
+ <para>
+ The Applied Rules views displays how often a rule tried to
+ apply and
+ how often the rule succeeded. Additionally some profiling
+ information is added after a short verbalisation of the rule. The
+ information is structured: if BLOCK constructs were used in the
+ executed TextMarker file, the rules contained in that block will be
+ represented as child node in the tree of the view. Each TextMarker
+ file is itself a BLOCK construct named after the file. Therefore
+ the root node of the view is always a BLOCK containing the rules of
+ the executed TextMarker script. Additionally, if a rule calls a
+ different TextMarker file, then the root block of that file is the
+ child of that rule. The selection of a rule in this view will
+ directly change the information visualized in the other views.
+
+ </para>
+ </section>
+ <section id="ugr.tools.tm.views.selected">
+ <title>Selected Rules</title>
+ <para>
+ This views is very similar to the Applied Rules view, but
+ displays only
+ rules and blocks under a given selection. If the user
+ clicks on the
+ document, then an Applied Rule view is generated
+ containing only
+ element that affect that position in the document.
+ The Rule
+ Elements view then only contains match information of that
+ position, but the result of the rule element match is still
+ displayed.
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.rulelist">
+ <title>Rule List</title>
+ <para>
+ This views is very similar to the Applied Rules view and the
+ Selected
+ Rules view, but displays only rules and NO blocks under
+ a
+ given
+ selection. If the user clicks on the document, then a list
+ of
+ rules
+ is generated that matched or tried to match on that
+ position in
+ the
+ document. The Rule Elements view then only contains
+ match
+ information of that position, but the result of the rule
+ element
+ match is still displayed. Additionally, this view provides a
+ text
+ field for filtering the rules. Only those rules remain that
+ contain
+ the entered text in their verbalization.
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.matched">
+ <title>Matched Rules</title>
+ <para>
+ If a rule is selected in the Applied Rules views, then this
+ view
+ displays the instances (text passages) where this rules
+ matched.
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.failed">
+ <title>Failed Rules</title>
+ <para>
+ If a rule is selected in the Applied Rules views, then this
+ view
+ displays the instances (text passages) where this rules failed
+ to
+ match.
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.elements">
+ <title>Rule Elements</title>
+ <para>
+ If a successful or failed rule match in the Matched Rules view
+ or
+ Failed Rules view is selected, then this views contains a listing
+ of the rule elements and their conditions. There is detailed
+ information available on what text each rule element matched and
+ which condition did evavaluate true.
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.statistics">
+ <title>Statistics</title>
+ <para>
+ This views displays the used conditions and actions of the
+ TextMarker
+ language. Three numbers are given for each element: The
+ total time
+ of execution, the amount of executions and the time per
+ execution.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.views.fp">
+ <title>False Positive</title>
+ <para>
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.fn">
+ <title>False Negative</title>
+ <para>
+ </para>
+ </section>
+
+ <section id="ugr.tools.tm.views.tp">
+ <title>True Positive</title>
+ <para>
+
+ </para>
+ </section>
+ </section>
+ <section id="ugr.tools.tm.testing">
+ <title>Testing</title>
+ <para>
+ The TextMarker Software comes bundled with its own testing
+ environment,
+ that allows you to test and evaluate TextMarker scripts.
+ It provides
+ full back end testing capabilities and allows you to
+ examine test
+ results in detail. As a product of the testing operation
+ a new
+ document file will be created and detailed information on how
+ well
+ the script performed in the test will be added to this document.
+ </para>
+ <section id="ugr.tools.tm.testing.overview">
+ <title>Overview</title>
+ <para>
+ The testing procedure compares a previously annotated gold standard
+ file with the result of the selected TextMarker script using an
+ evaluator. The evaluators compare the offsets of annotations in
+ both documents and, depending on the evaluator, mark a result
+ document with true positive, false positive or false negative
+ annotations. Afterwards the f1-score is calculated for the whole
+ set of tests, each test file and each type in the test file.
+ The testing environment contains the following parts :
+ <itemizedlist>
+ <listitem>
+ <para>Main view</para>
+ </listitem>
+ <listitem>
+ <para>Result views : true positive, false positive, false
+ negative view
+ </para>
+ </listitem>
+ <listitem>
+ <para>Preference page</para>
+ </listitem>
+ </itemizedlist>
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG"
+ fileref="&imgroot;Screenshot_main.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Eclipse with open TextMarker and testing environment.
+ </phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+ All control elements,that are needed for the interaction with the
+ testing environment, are located in the main view.
+ This is also
+ where test files can be selected and information, on how
+ well the
+ script performed is, displayed. During the testing process
+ a result
+ CAS file is produced that will contain new
+ annotation types like
+ true positives (tp), false positives (fp) and false
+ negatives (fn).
+ While displaying the result .xmi file in the script
+ editor,
+ additional
+ views allow easy navigation through the new annotations.
+ Additional tree
+ views, like the true positive view, display the
+ corresponding
+ annotations in a
+ hierarchic structure. This allows an
+ easy tracing of the results inside the
+ testing document. A
+ preference page allows customization of the
+ behavior
+ of the testing
+ plug-in.
+ </para>
+ <section id="ugr.tools.tm.testing.overview.main">
+ <title>Main View</title>
+ <para>
+ The following picture shows a close up view of the testing
+ environments main-view part. The toolbar contains all buttons
+ needed to operate the plug-ins. The first line shows the name of
+ the script that is going to be tested and a combo-box, where the
+ view, that should be tested, is selected. On the right follow
+ fields that will show some basic information of the results of the
+ test-run.
+ Below and on the left the test-list is located. This list
+ contains the
+ different test-files. Right besides it, you will find
+ a table with
+ statistic information. It shows a total tp, fp and fn
+ information,
+ as well as precision, recall and f1-score of every
+ test-file and
+ for every type in each file.
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG"
+ fileref="&imgroot;Screenshot_testing_desc_3_resize.png" />
+ </imageobject>
+ <textobject>
+ <phrase>The main view of the testing environment.</phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.testing.overview.result">
+ <title>Result Views</title>
+ <para>
+ This views add additional information to the CAS View, once a
+ result file is opened. Each view displays one of the following
+ annotation types in a hierarchic tree structure : true positives,
+ false positive and false negative. Adding a check mark to one of
+ the annotations in a result view, will highlight the annotation in
+ the CAS Editor.
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG"
+ fileref="&imgroot;Screenshot_result.png" />
+ </imageobject>
+ <textobject>
+ <phrase>The main view of the testing environment.</phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.testing.overview.preferences">
+ <title>Preference Page</title>
+ <para>
+ The preference page offers a few options that will modify the
+ plug-ins general behavior. For example the preloading of
+ previously collected result data can be turned off, should it
+ produce a to long loading time. An important option in the
+ preference page is the evaluator you can select. On default the
+ "exact evaluator" is selected, which compares the offsets of the
+ annotations, that are contained in the file produced by the
+ selected script, with the annotations in the test file. Other
+ evaluators will compare annotations in a different way.
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG"
+ fileref="&imgroot;Screenshot_preferences.png" />
+ </imageobject>
+ <textobject>
+ <phrase>The preference page of the testing environment.
+ </phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.testing.overview.project">
+ <title>The TextMarker Project Structure</title>
+ <para>
+ The picture shows the TextMarker's script explorer. Every
+ TextMarker project contains a folder called "test". This folder is
+ the default location for the test-files. In the folder each
+ script-file has its own sub-folder with a relative path equal to
+ the scripts package path in the "script" folder. This folder
+ contains the test files. In every scripts test-folder you will
+ also find a result folder with the results of the tests. Should
+ you use test-files from another location in the file-system, the
+ results will be saved in the "temp" sub-folder of the projects
+ "test" folder. All files in the "temp" folder will be deleted,
+ once eclipse is closed.
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG"
+ fileref="&imgroot;folder_struc_sep_desc_cut.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Script Explorer with the test folder expanded.</phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+ </section>
+ </section>
+ <section id="ugr.tools.tm.testing.usage">
+ <title>Usage</title>
+ <para>
+ This section will demonstrate how to use the testing
+ environment.
+ It will show the basic actions needed to perform a test
+ run.
+ </para>
+ <para>
+ Preparing Eclipse:
+ The testing environment provides its own
+ perspective called
+ "TextMarker Testing". It will display the main
+ view as well as the
+ different result views on the right hand side.
+ It is encouraged to
+ use this perspective, especially when working
+ with the testing
+ environment for the first time.
+ </para>
+ <para>
+ Selecting a script for testing:
+ TextMarker will always test the
+ script, that is currently open in the
+ script-editor. Should another
+ editor be open, for example a
+ java-editor with some java class being
+ displayed, you will see that
+ the testing view is not available.
+ </para>
+ <para>
+ Creating a test file:
+ A test-file is a previously annotated
+ .xmi file that can be used as
+ a golden standard for the test. To
+ create such a file, no
+ additional tools will be provided, instead
+ the TextMarker system
+ already provides such tools.
+ </para>
+ <para>
+ Selecting a test-file:
+ Test files can be added to the test-list
+ by simply dragging them from
+ the Script Explorer into the test-file
+ list. Depending on the
+ setting in the preference page, test-files
+ from a scripts "test"
+ folder might already be loaded into the list.
+ A different way to
+ add test-files is to use the "Add files from
+ folder" button. It can
+ be used to add all .xmi files from a selected
+ folder. The "del" key
+ can be used to remove files from the
+ test-list.
+ </para>
+ <para>
+ Selecting a CAS View to test:
+ TextMarker supports different
+ views, that allow you to operate on different
+ levels in a document.
+ The InitialView is selected as default,
+ however you can also switch
+ the evaluation to another view by
+ typing the views name into the
+ list or selecting the view you wish
+ to use from the list.
+ </para>
+ <para>
+ Selecting the evaluator:
+ The testing environment supports
+ different evaluators that allow a
+ sophisticated analysis of the
+ behavior of a TextMarker script. The
+ evaluator can be chosen in the
+ testing environments preference
+ page. The preference page can be
+ opened either trough the menu or
+ by clicking the blue preference
+ buttons in the testing views
+ toolbar. The default evaluator is the
+ "Exact CAS Evaluator" which
+ compares the offsets of the annotations
+ between the test file and
+ the file annotated by the tested script.
+ </para>
+ <para>
+ Excluding Types:
+ During a test-run it might be convenient to
+ disable testing for specific
+ types like punctuation or tags. The
+ ''exclude types`` button will
+ open a dialog where all types can be
+ selected that should not be
+ considered in the test.
+ </para>
+ <para>
+ Running the test:
+ A test-run can be started by clicking on the
+ green start button in
+ the toolbar.
+ </para>
+ <para>
+ Result Overview:
+ The testing main view displays some
+ information, on how well the
+ script did, after every test run. It
+ will display an overall number
+ of true positive, false positive and
+ false negatives annotations of
+ all result files as well as an
+ overall f1-score. Furthermore a
+ table will be displayed that
+ contains the overall statistics of the
+ selected test file as well as
+ statistics for every single type in
+ the test file. The information
+ displayed are true positives, false
+ positives, false negatives,
+ precision, recall and f1-measure.
+ </para>
+ <para>
+ The testing environment also supports the export of the
+ overall data
+ in form of a comma-separated table. Clicking the export
+ evaluation
+ data will open a dialog window that contains this table.
+ The text
+ in this table can be copied and easily imported into
+ OpenOffice.org
+ or MS Excel.
+ </para>
+ <para>
+ Result Files:
+ When running a test, the evaluator will create a new
+ result .xmi file
+ and will add new true positive, false positive and
+ false negative
+ annotations. By clicking on a file in the test-file
+ list, you can
+ open the corresponding result .xmi file in the
+ TextMarker script
+ editor. When opening a result file in the script
+ explorer,
+ additional views will open, that allow easy access and
+ browsing of
+ the additional debugging annotations.
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="80" format="PNG"
+ fileref="&imgroot;Screenshot_Result_TP_desc_close_cut.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Open result file and selected true positive annotation
+ in the true positive view.
+ </phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+ </para>
+ </section>
+ <section id="ugr.tools.tm.testing.evaluators">
+ <title>Evaluators</title>
+ <para>
+ When testing a CAS file, the system compared the offsets of
+ the
+ annotations of a previously annotated gold standard file with
+ the
+ offsets of the annotations
+ of the result file the script
+ produced. Responsible for comparing
+ annotations in the two CAS files
+ are evaluators. These evaluators
+ have different methods
+ and
+ strategies, for comparing the annotations, implemented. Also a
+ extension point is provided that allows easy implementation new
+ evaluators.
+ </para>
+ <para>
+ Exact Match Evaluator:
+ The Exact Match Evaluator compares the
+ offsets of the annotations in
+ the result and the golden standard
+ file. Any difference will be
+ marked with either an false positive or
+ false negative annotations.
+ </para>
+ <para>
+ Partial Match Evaluator:
+ The Partial Match Evaluator compares
+ the offsets of the annotations in
+ the result and golden standard
+ file. It will allow differences in
+ the beginning or the end of an
+ annotation. For example "corresponding" and "corresponding " will
+ not be
+ annotated as an error.
+ </para>
+ <para>
+ Core Match Evaluator:
+ The Core Match Evaluator accepts
+ annotations that share a core
+ expression. In this context a core
+ expression is at least four
+ digits long and starts with a
+ capitalized letter. For example the
+ two annotations "L404-123-421"
+ and "L404-321-412" would be
+ considered a true positive match,
+ because of "L404" is considered a
+ core expression that is contained
+ in both annotations.
+ </para>
+ <para>
+ Word Accuracy Evaluator:
+ Compares the labels of all
+ words/numbers in an annotation, whereas the
+ label equals the type of
+ the annotation. This has the consequence,
+ for example, that each
+ word or number that is not part of the
+ annotation is counted as a
+ single false negative. For example we
+ have the sentence: "Christmas
+ is on the 24.12 every year."
+ The script labels "Christmas is on the
+ 12" as a single sentence, while
+ the test file labels the sentence
+ correctly with a single sentence
+ annotation. While for example the
+ Exact CAS Evaluator while only
+ assign a single False Negative
+ annotation, Word Accuracy Evaluator
+ will mark every word or number
+ as a single False Negative.
+ </para>
+ <para>
+ Template Only Evaluator:
+ This Evaluator compares the offsets of
+ the annotations and the
+ features, that have been created by the
+ script. For example the
+ text "Alan Mathison Turing" is marked with
+ the author annotation
+ and "author" contains 2 features: "FirstName"
+ and "LastName". If
+ the script now creates an author annotation with
+ only one feature,
+ the annotation will be marked as a false positive.
+ </para>
+ <para>
+ Template on Word Level Evaluator:
+ The Template On Word
+ Evaluator compares the offsets of the
+ annotations. In addition it
+ also compares the features and feature
+ structures and the values
+ stored in the features. For example the
+ annotation "author" might
+ have features like "FirstName" and
+ "LastName" The authors name is
+ "Alan Mathison Turing" and the
+ script correctly assigns the author
+ annotation. The feature
+ assigned by the script are "Firstname :
+ Alan", "LastName :
+ Mathison", while the correct feature values would
+ be "FirstName
+ Alan", "LastName Turing". In this case the Template
+ Only Evaluator
+ will mark an annotation as a false positive, since the
+ feature
+ values differ.
+ </para>
+ </section>
+
+ </section>
+ <section id="ugr.tools.tm.textruler">
+ <title>TextRuler</title>
+ <para>
+ Using the knowledge engineering approach, a knowledge engineer
+ normally
+ writes handcrafted rules to create a domain dependent
+ information
+ extraction application, often supported by a gold
+ standard. When
+ starting the engineering process for the acquisition
+ of the
+ extraction knowledge for possibly new slot or more general for
+ new
+ concepts, machine learning methods are often able to offer
+ support
+ in an iterative engineering process. This section gives a
+ conceptual
+ overview of the process model for the semi-automatic
+ development of
+ rule-based information extraction applications.
+ </para>
+ <para>
+ First, a suitable set of documents that contain the text
+ fragments with
+ interesting patterns needs to be selected and
+ annotated with the
+ target concepts. Then, the knowledge engineer
+ chooses and configures
+ the methods for automatic rule acquisition to
+ the best of his
+ knowledge for the learning task: Lambda expressions
+ based on tokens
+ and linguistic features, for example, differ in their
+ application
+ domain from wrappers that process generated HTML pages.
+ </para>
+ <para>
+ Furthermore, parameters like the window size defining relevant
+ features need to
+ be set to an appropriate level. Before the annotated
+ training
+ documents form the input of the learning task, they are
+ enriched
+ with features generated by the partial rule set of the
+ developed
+ application. The result of the methods, that is the learned
+ rules,
+ are proposed to the knowledge engineer for the extraction of
+ the
+ target concept.
+ </para>
+ <para>
+ The knowledge engineer has different options to proceed: If the
+ quality, amount or generality of the presented rules is not
+ sufficient, then additional training documents need to be annotated
+ or additional rules have to be handcrafted to provide more features
+ in general or more appropriate features. Rules or rule sets of high
+ quality can be modified, combined or generalized and transfered to
+ the rule set of the application in order to support the extraction
+ task of the target concept. In the case that the methods did not
+ learn reasonable rules at all, the knowledge engineer proceeds with
+ writing handcrafted rules.
+ </para>
+ <para>
+ Having gathered enough extraction knowledge for the current
+ concept, the
+ semi-automatic process is iterated and the focus is
+ moved to the
+ next concept until the development of the application is
+ completed.
+ </para>
+ <section id="ugr.tools.tm.textruler.learner">
+ <title>Available Learners</title>
+ <para>
+ Overview
+
+ ||Name||Strategy||Document||Slots||Status
+ |BWI (1)
+ |Boosting, Top Down |Struct, Semi |Single, Boundary |Planning
+ |LP2
+ (2) |Bottom Up Cover |All |Single, Boundary |Prototype
+ |RAPIER (3)
+ |Top Down/Bottom Up Compr. |Semi |Single |Experimental
+ |WHISK (4)
+ |Top Down Cover |All |Multi |Prototype
+ |WIEN (5) |CSP |Struct
+ |Multi, Rows |Prototype
+ </para>
+ <para>
+ * Strategy: The used strategy of the learning methods are
+ commonly
+ coverage algorithms.
+ * Document: The type of the document
+ may be ''free'' like in
+ newspapers, ''semi'' or ''struct'' like HTML
+ pages.
+ * Slots: The slots refer to a single annotation that
+ represents the
+ goal of the learning task. Some rule are able to
+ create several
+ annotation at once in the same context (multi-slot).
+ However, only
+ single slots are supported by the current
+ implementations.
+ * Status: The current status of the implementation
+ in the TextRuler
+ framework.
+ </para>
+ <para>
+ Publications
+ </para>
+ <para>
+ (1) Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper
+ Induction.
+ In AAAI/IAAI, pages 577â583, 2000.
+ </para>
+ <para>
+ (2) F. Ciravegna. (LP)2, Rule Induction for Information
+ Extraction
+ Using Linguistic Constraints. Technical Report CS-03-07,
+ Department
+ of Computer Science, University of Sheffield, Sheffield,
+ 2003.
+ </para>
+ <para>
+ (3) Mary Elaine Califf and Raymond J. Mooney. Bottom-up
+ Relational
+ Learning of Pattern Matching Rules for Information
+ Extraction.
+ Journal of Machine Learning Research, 4:177â210, 2003.
+ </para>
+ <para>
+ (4) Stephen Soderland, Claire Cardie, and Raymond Mooney.
+ Learning
+ Information Extraction Rules for Semi-Structured and Free
+ Text. In
+ Machine Learning, volume 34, pages 233â272, 1999.
+ </para>
+ <para>
+ (5) N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper
+ Induction for
+ Information Extraction. In Proc. IJC Artificial
+ Intelligence, 1997.
+ </para>
+ <para>
+ BWI
+ BWI (Boosted Wrapper Induction) uses boosting techniques to
+ improve
+ the performance of simple pattern matching single-slot
+ boundary
+ wrappers (boundary detectors). Two sets of detectors are
+ learned:
+ the "fore" and the "aft" detectors. Weighted by their
+ confidences
+ and combined with a slot length histogram derived from
+ the training
+ data they can classify a given pair of boundaries
+ within a
+ document. BWI can be used for structured, semi-structured
+ and free
+ text. The patterns are token-based with special wildcards
+ for more
+ general rules.
+ </para>
+ <para>
+ Implementations
+ No implementations are yet available.
+ </para>
+ <para>
+ Parameters
+ No parameters are yet available.
+
+ </para>
+ <para>
+ LP2
+ This method operates on all three kinds of documents. It
+ learns
+ separate rules for the beginning and the end of a single
+ slot. So
+ called tagging rules insert boundary SGML tags and
+ additionally
+ induced correction rules shift misplaced tags to their
+ correct
+ positions in order to improve precision. The learning
+ strategy is a
+ bottom-up covering algorithm. It starts by creating a
+ specific seed
+ instance with a window of w tokens to the left and
+ right of the
+ target boundary and searches for the best
+ generalization. Other
+ linguistic NLP-features can be used in order
+ to generalize over the
+ flat word sequence.
+ </para>
+ <para>
+ Implementations
+ LP2 (naive):
+ LP2 (optimized):
+ </para>
+ <para>
+ Parameters
+ Context Window Size (to the left and right):
+ Best
+ Rules List Size:
+ Minimum Covered Positives per Rule:
+ Maximum Error
+ Threshold:
+ Contextual Rules List Size:
+ </para>
+ <para>
+ RAPIER
+ RAPIER induces single slot extraction rules for
+ semi-structured
+ documents. The rules consist of three patterns: a
+ pre-filler, a
+ filler and a post-filler pattern. Each can hold
+ several constraints
+ on tokens and their according POS-tag- and
+ semantic information.
+ The algorithm uses a bottom-up compression
+ strategy, starting with
+ a most specific seed rule for each training
+ instance. This initial
+ rule base is compressed by randomly selecting
+ rule pairs and search
+ for the best generalization. Considering
+ two
+ rules, the least general generalization (LGG) of the slot fillers
+ are created and specialized by adding rule items to the pre- and
+ post-filler until the new rules operate well on the training set.
+ The best of the k rules (k-beam search) is added to the rule base
+ and all empirically subsumed rules are removed.
+ </para>
+ <para>
+ Implementations
+ RAPIER:
+ </para>
+ <para>
+ Parameters
+ Maximum Compression Fail Count:
+ Internal Rules List
+ Size:
+ Rule Pairs for Generalizing:
+ Maximum 'No improvement' Count:
+ Maximum Noise Threshold:
+ Minimum Covered Positives Per Rule:
+ PosTag
+ Root Type:
+ Use All 3 GenSets at Specialization:
+ </para>
+ <para>
+ WHISK
+ WHISK is a multi-slot method that operates on all three
+ kinds of
+ documents and learns single- or multi-slot rules looking
+ similar to
+ regular expressions. The top-down covering algorithm
+ begins with
+ the most general rule and specializes it by adding
+ single
+ rule terms until the rule makes no errors on the training
+ set. Domain
+ specific classes or linguistic information obtained by a
+ syntactic
+ analyzer can be used as additional features. The exact
+ definition
+ of a rule term (e.g. a token) and of a problem instance
+ (e.g. a
+ whole document or a single sentence) depends on the
+ operating
+ domain and document
+ type.
+ </para>
+ <para>
+ Implementations
+ WHISK (token):
+ WHISK (generic):
+ </para>
+ <para>
+ Parameters
+ Window Size:
+ Maximum Error Threshold:
+ PosTag Root
+ Type:
+ </para>
+ <para>
+ WIEN
+ WIEN is the only method listed here that operates on
+ highly structured
+ texts only. It induces so called wrappers that
+ anchor the slots by
+ their structured context around them. The HLRT
+ (head left right
+ tail) wrapper class for example can determine and
+ extract
+ several multi-slot-templates by first separating the
+ important information
+ block from unimportant head and tail portions
+ and then extracting
+ multiple data rows from table like
+ data
+ structures from the remaining document. Inducing a wrapper is done
+ by solving a CSP for all possible pattern combinations from the
+ training data.
+ </para>
+ <para>
+ Implementations
+ WIEN:
+ </para>
+ <para>
+ Parameters
+ No parameters are available.
+
+ </para>
+ </section>
+ </section>
+
+</chapter>
\ No newline at end of file