You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by pk...@apache.org on 2013/06/07 19:11:08 UTC
svn commit: r1490733 - in /uima/sandbox/ruta/trunk/ruta-docbook/src/docbook:
./ images/tools/ruta/workbench/textruler/
Author: pkluegl
Date: Fri Jun 7 17:11:07 2013
New Revision: 1490733
URL: http://svn.apache.org/r1490733
Log:
UIMA-2777
- started to rewrite textruler section in documentation
Added:
uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/
uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png (with props)
uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png (with props)
Modified:
uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml
Added: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png
URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png?rev=1490733&view=auto
==============================================================================
Binary file - no diff available.
Propchange: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png
URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png?rev=1490733&view=auto
==============================================================================
Binary file - no diff available.
Propchange: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Modified: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml?rev=1490733&r1=1490732&r2=1490733&view=diff
==============================================================================
--- uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml (original)
+++ uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml Fri Jun 7 17:11:07 2013
@@ -24,199 +24,60 @@ specific language governing permissions
under the License.
-->
-<section id="section.ugr.tools.ruta.workbench.textruler">
+<section id="section.tools.ruta.workbench.textruler">
<title>TextRuler</title>
- <para> Using the knowledge engineering approach, a knowledge engineer normally writes handcrafted
- rules to create a domain dependent information extraction application often supported by a gold
- standard. When starting the engineering process for the acquisition of the extraction knowledge
- for possible new slot or more generally for new concepts, machine learning methods are often able
- to offer support in an iterative engineering process. This section gives a conceptual overview
- of the process model for the semi-automatic development of rule-based information extraction
- applications.
+ <para>
+ Apache UIMA Ruta TextRuler is a framework for supervised rule induction included in the UIMA Ruta Workbench.
+ It provides several configurable algorithms, which are able to learn new rules based on given labeled data.
+ The framework was created in order to support the user by suggesting new rules for the given task.
+ The user selects a suitable learning algorithm and adapts its configuration parameters. Furthermore,
+ the user engineers a set of annotation-based features, which enable the algorithms to form efficient, effective and comprehensive rules.
+ The rule learning algorithms present their suggested rules in a new view, in which the user can either copy
+ the complete script or single rules to a new script file, where the rules can be further refined.
</para>
- <para> First, a suitable set of documents that contains the text fragments with patterns needs to be selected and annotated with the target concepts. Then, the knowledge
- engineer chooses and configures the methods for automatic rule acquisition to the best of his
- knowledge for the learning task: Lambda expressions based on tokens and linguistic features, for
- example, differ in their application domain from wrappers that process generated HTML pages.
+ <para>
+ This section gives a short introduction about the included features and learners, and how to use the framework to learn UIMA Ruta rules. First, the
+ available rule learning algorithms are introduced in <xref linkend="section.tools.ruta.workbench.textruler.learner"/>. Then,
+ the user interface and the usage is explained in <xref linkend="section.tools.ruta.workbench.textruler.ui"/> using an exemplary UIMA Ruat project.
</para>
- <para> Furthermore, parameters like the window size defining relevant features need to be set to
- an appropriate level. Before the annotated training documents form the input of the learning
- task, they are enriched with features generated by the partial rule set of the developed
- application. The result of the methods, which are the learned rules, are proposed to the knowledge
- engineer for the extraction of the target concept.
- </para>
- <para> The knowledge engineer has different options to proceed: If the quality, amount or
- generality of the presented rules is not sufficient, then additional training documents need to
- be annotated or additional rules have to be handcrafted to provide more features in general or
- more appropriate features. Rules or rule sets of high quality can be modified, combined or
- generalized and transfered to the rule set of the application in order to support the extraction
- task of the target concept. In the case that the methods did not learn reasonable rules at all,
- the knowledge engineer proceeds with writing handcrafted rules.
- </para>
- <para> Having gathered enough extraction knowledge for the current concept, the semi-automatic
- process is iterated and the focus is moved to the next concept until the development of the
- application is completed.
- </para>
- <section id="ugr.tools.ruta.textruler.learner">
- <title>Available Learners</title>
- <para>
- The available learners are based on the following publications:
- <orderedlist numeration="arabic">
- <!--
- <listitem>
- <para> Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI,
- pages 577-583, 2000.</para>
- </listitem>
- -->
- <listitem>
- <para> F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
- Constraints. Technical Report CS-03-07, Department of Computer Science, University of
- Sheffield, Sheffield, 2003.</para>
- </listitem>
- <listitem>
- <para> Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of Pattern
- Matching Rules for Information Extraction. Journal of Machine Learning Research,
- 4:177-210, 2003.</para>
- </listitem>
- <listitem>
- <para> Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
- Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
- pages 233-272, 1999.</para>
- </listitem>
- <listitem>
- <para> N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information
- Extraction. In Proc. IJC Artificial Intelligence, 1997.</para>
- </listitem>
- </orderedlist>
- </para>
+ <section id="section.tools.ruta.workbench.textruler.learner">
+ <title>Included rule learning algorithms</title>
<para>
- Each available learner has several features. Their meaning is explained here:
- <itemizedlist>
- <listitem>
- <para> Strategy: The used strategy of the learning methods are commonly coverage
- algorithms.</para>
- </listitem>
- <listitem>
- <para>
- Document: The type of the document may be <quote>free</quote>
- like in newspapers, <quote>semi</quote>
- or <quote>struct</quote> like in HTML pages.
- </para>
- </listitem>
- <listitem>
- <para> Slots: The slots refer to a single annotation that represents the goal of the
- learning task. Some rule are able to create several annotations at once in the same
- context (multi-slot). However, only single slots are supported by the current
- implementations.</para>
- </listitem>
- <listitem>
- <para> Status: The current status of the implementation in the TextRuler framework.</para>
- </listitem>
- </itemizedlist>
- </para>
- <para>
- The following table gives an overview:
- <table id="table.ugr.tools.ruta.workbench.textruler.available_learners" frame="all">
- <title>Overview of available learners</title>
- <tgroup cols="6" colsep="1" rowsep="1">
- <colspec colname="c1" colwidth="1*" />
- <colspec colname="c2" colwidth="1*" />
- <colspec colname="c3" colwidth="1*" />
- <colspec colname="c4" colwidth="1*" />
- <colspec colname="c5" colwidth="1*" />
- <colspec colname="c6" colwidth="1*" />
- <thead>
- <row>
- <entry align="center">Name</entry>
- <entry align="center">Strategy</entry>
- <entry align="center">Document</entry>
- <entry align="center">Slots</entry>
- <entry align="center">Status</entry>
- <entry align="center">Publication</entry>
- </row>
- </thead>
- <tbody>
- <!--
- <row>
- <entry>BWI</entry>
- <entry>Boosting, Top Down</entry>
- <entry>Struct, Semi</entry>
- <entry>Single, Boundary</entry>
- <entry>Planning</entry>
- <entry>1</entry>
- </row>
- -->
- <row>
- <entry>LP2</entry>
- <entry>Bottom Up Cover</entry>
- <entry>All</entry>
- <entry>Single, Boundary</entry>
- <entry>Prototype</entry>
- <entry>1</entry>
- </row>
- <row>
- <entry>RAPIER</entry>
- <entry>Top Down/Bottom Up Compr.</entry>
- <entry>Semi</entry>
- <entry>Single</entry>
- <entry>Experimental</entry>
- <entry>2</entry>
- </row>
- <row>
- <entry>WHISK</entry>
- <entry>Top Down Cover</entry>
- <entry>All</entry>
- <entry>Multi</entry>
- <entry>Prototype</entry>
- <entry>3</entry>
- </row>
- <row>
- <entry>WIEN</entry>
- <entry>CSP</entry>
- <entry>Struct</entry>
- <entry>Multi, Rows</entry>
- <entry>Prototype</entry>
- <entry>4</entry>
- </row>
- </tbody>
- </tgroup>
- </table>
- </para>
- <!--
- <section id="section.ugr.tools.ruta.workbench.textruler.bwi">
- <title>BWI (Boosted Wrapper Induction)</title>
- <para> BWI uses boosting techniques to improve the performance of simple pattern matching
- single-slot boundary wrappers (boundary detectors). Two sets of detectors are learned: the
- "fore" and the "aft" detectors. Weighted by their confidences and combined with a slot
- length histogram derived from the training data they can classify a given pair of boundaries
- within a document. BWI can be used for structured, semi-structured and free text. The
- patterns are token-based with special wildcards for more general rules. </para>
- <para> Implementations No implementations are yet available. </para>
- <para> Parameters No parameters are yet available. </para>
- </section>
- -->
- <section id="section.ugr.tools.ruta.workbench.textruler.lp2">
+ This section gives a short description of the rule learning algorithms,
+ which are provided in the UIMA Ruta TextRuler framework.
+ </para>
+ <section id="section.tools.ruta.workbench.textruler.lp2">
<title>LP2</title>
- <para>This method operates on all three kinds of documents. It learns separate rules for
- the beginning and the end of a single slot. Tagging rules insert boundary SGML
- tags and, additionally, induced correction rules shift misplaced tags to their correct
- positions in order to improve precision. The learning strategy is a bottom-up covering
+ <note>
+ <para>
+ This rule learner is an experimental implementation of the ideas and algorithms published in:
+ F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
+ Constraints. Technical Report CS-03-07, Department of Computer Science, University of
+ Sheffield, Sheffield, 2003.
+ </para>
+ </note>
+ <para>This algorithms learns separate rules for
+ the beginning and the end of a single slot, which are later combined
+ in order to identify the targeted annotation. The learning strategy is a bottom-up covering
algorithm. It starts by creating a specific seed instance with a window of w tokens to the
- left and right of the target boundary and searches for the best generalization. Other
- linguistic NLP-features can be used in order to generalize over the flat word sequence.
+ left and right of the target boundary and searches for the best generalization. Additional context rules are
+ induced in order to identify missing boundaries. The current implementation does not support correction rules.
+ The TextRuler framework provides two versions of this algorithm: LP2 (naive) is a straightforward implementation
+ with limited expressiveness concerning the resulting Ruta rules. LP2 (optimized) is an improved
+ version with a dynamic programming approach and is providing better results in general.
+ The following parameters are available. For a more detailed description of the parameters,
+ please refer to the implementation and the publication.
</para>
<para>
- Parameters:
- </para>
<itemizedlist>
<listitem>
<para>Context Window Size (to the left and right)</para>
</listitem>
<listitem>
- <para>Best Rules List Size: Minimum</para>
+ <para>Best Rules List Size</para>
</listitem>
<listitem>
- <para>Covered Positives per Rule</para>
+ <para>Minimum Covered Positives per Rule</para>
</listitem>
<listitem>
<para>Maximum Error Threshold</para>
@@ -225,55 +86,28 @@ under the License.
<para>Contextual Rules List Size</para>
</listitem>
</itemizedlist>
- </section>
- <section id="section.ugr.tools.ruta.workbench.textruler.rapier">
- <title>RAPIER</title>
- <para>RAPIER induces single slot extraction rules for semi-structured documents. The rules
- consist of three patterns: a pre-filler, a filler and a post-filler pattern. Each pattern can hold
- several constraints on tokens and their according POS-tag- and semantic information. The
- algorithm uses a bottom-up compression strategy starting with a most specific seed rule for
- each training instance. This initial rule base is compressed by randomly selecting rule
- pairs and search for the best generalization. Considering two rules, the least general
- generalization (LGG) of the slot fillers are created and specialized by adding rule items to
- the pre- and post-filler until the new rules operate well on the training set. The best of
- the k rules (k-beam search) is added to the rule base and all empirically subsumed rules are
- removed. </para>
- <para>
- Parameters:
</para>
- <itemizedlist>
- <listitem>
- <para>Parameters Maximum Compression Fail Count</para>
- </listitem>
- <listitem>
- <para>Internal Rules List Size: Rule Pairs for Generalizing</para>
- </listitem>
- <listitem>
- <para>Maximum 'No improvement' Count</para>
- </listitem>
- <listitem>
- <para>Maximum Noise Threshold: Minimum Covered Positives Per Rule</para>
- </listitem>
- <listitem>
- <para>PosTag Root Type</para>
- </listitem>
- <listitem>
- <para>Use All 3 GenSets at Specialization</para>
- </listitem>
- </itemizedlist>
</section>
- <section id="section.ugr.tools.ruta.workbench.textruler.whisk">
+
+ <section id="section.tools.ruta.workbench.textruler.whisk">
<title>WHISK</title>
- <para> WHISK is a multi-slot method that operates on all three kinds of documents and learns
- single- or multi-slot rules looking similar to regular expressions. The top-down covering
- algorithm begins with the most general rule and specializes it by adding single rule terms
- until the rule does not make errors anymore on the training set. Domain specific classes or linguistic
- information obtained by a syntactic analyzer can be used as additional features. The exact
- definition of a rule term (e.g., a token) and of a problem instance (e.g., a whole document or
- a single sentence) depends on the operating domain and document type. </para>
+ <note>
<para>
- Parameters:
+ This rule learner is an experimental implementation of the ideas and algorithms published in:
+ Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
+ Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
+ pages 233-272, 1999.
</para>
+ </note>
+ <para>WHISK is a multi-slot method that operates on all three kinds of documents and learns
+ single- or multi-slot rules looking similar to regular expressions. However, the current implementation only support single slot rules.
+ The top-down covering algorithm begins with the most general rule and specializes it by adding single rule terms
+ until the rule does not make errors anymore on the training set. The TextRuler framework provides two versions of this algorithm:
+ WHISK (token) is a naive token-based implementation. WHISK (generic) is an optimized and improved implementation,
+ which is able to refer to arbitrary annotations and also supports primitive features. The following parameters are available. For a more detailed description of the parameters,
+ please refer to the implementation and the publication.
+ </para>
+ <para>
<itemizedlist>
<listitem>
<para>Parameters Window Size</para>
@@ -284,17 +118,51 @@ under the License.
<listitem>
<para>PosTag Root Type</para>
</listitem>
+ <listitem>
+ <para>Considered Features (comma-separated) - only WHISK (generic)</para>
+ </listitem>
</itemizedlist>
- </section>
- <section id="section.ugr.tools.ruta.workbench.textruler.wien">
- <title>WIEN </title>
- <para> WIEN is the only method listed here that operates on highly structured texts only. It
- induces wrappers that anchor the slots by their structured context.
- The HLRT (head left right tail) wrapper class for example can determine and extract several
- multi-slot-templates by first separating the important information block from unimportant
- head and tail portions and extracting multiple data rows from table like data
- structures from the remaining document. Inducing a wrapper is done by solving a CSP for all
- possible pattern combinations from the training data. </para>
- </section>
- </section>
-</section>
\ No newline at end of file
+ </para>
+ </section>
+ </section>
+ <section id="section.tools.ruta.workbench.textruler.ui">
+ <title>The TextRuler view</title>
+ <para>
+ </para>
+ <figure id="figure.tools.ruta.workbench.textruler.main">
+ <title>The UIMA Ruta TextRuler framework
+ </title>
+ <mediaobject>
+ <imageobject role="html">
+ <imagedata width="776px" format="PNG" align="center"
+ fileref="&imgroot;textruler/textruler.png" />
+ </imageobject>
+ <imageobject role="fo">
+ <imagedata width="5.4in" format="PNG" align="center"
+ fileref="&imgroot;textruler/textruler.png" />
+ </imageobject>
+ <textobject>
+ <phrase>UIMA Ruta TextRuler framework</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+ <figure id="figure.tools.ruta.workbench.textruler.pref">
+ <title>The UIMA Ruta TextRuler Preferences
+ </title>
+ <mediaobject>
+ <imageobject role="html">
+ <imagedata width="576px" format="PNG" align="center"
+ fileref="&imgroot;textruler/textruler_pref.png" />
+ </imageobject>
+ <imageobject role="fo">
+ <imagedata width="3.3in" format="PNG" align="center"
+ fileref="&imgroot;textruler/textruler_pref.png" />
+ </imageobject>
+ <textobject>
+ <phrase>UIMA Ruta TextRuler Preferences</phrase>
+ </textobject>
+ </mediaobject>
+ </figure>
+
+ </section>
+</section>