You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by pk...@apache.org on 2013/06/07 19:11:08 UTC
svn commit: r1490733 - in /uima/sandbox/ruta/trunk/ruta-docbook/src/docbook: ./ images/tools/ruta/workbench/textruler/

Author: pkluegl
Date: Fri Jun  7 17:11:07 2013
New Revision: 1490733

URL: http://svn.apache.org/r1490733
Log:
UIMA-2777
- started to rewrite textruler section in documentation

Added:
    uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/
    uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png   (with props)
    uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png   (with props)
Modified:
    uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml

Added: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png
URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png?rev=1490733&view=auto
==============================================================================
Binary file - no diff available.

Propchange: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png
URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png?rev=1490733&view=auto
==============================================================================
Binary file - no diff available.

Propchange: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/images/tools/ruta/workbench/textruler/textruler_pref.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml?rev=1490733&r1=1490732&r2=1490733&view=diff
==============================================================================
--- uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml (original)
+++ uima/sandbox/ruta/trunk/ruta-docbook/src/docbook/tools.ruta.workbench.textruler.xml Fri Jun  7 17:11:07 2013
@@ -24,199 +24,60 @@ specific language governing permissions 
 under the License.
 -->
 
-<section id="section.ugr.tools.ruta.workbench.textruler">
+<section id="section.tools.ruta.workbench.textruler">
   <title>TextRuler</title>
-  <para> Using the knowledge engineering approach, a knowledge engineer normally writes handcrafted
-    rules to create a domain dependent information extraction application often supported by a gold
-    standard. When starting the engineering process for the acquisition of the extraction knowledge
-    for possible new slot or more generally for new concepts, machine learning methods are often able
-    to offer support in an iterative engineering process. This section gives a conceptual overview
-    of the process model for the semi-automatic development of rule-based information extraction
-    applications.
+  <para>
+    Apache UIMA Ruta TextRuler is a framework for supervised rule induction included in the UIMA Ruta Workbench. 
+    It provides several configurable algorithms, which are able to learn new rules based on given labeled data.
+    The framework was created in order to support the user by suggesting new rules for the given task. 
+    The user selects a suitable learning algorithm and adapts its configuration parameters. Furthermore, 
+    the user engineers a set of annotation-based features, which enable the algorithms to form efficient, effective and comprehensive rules.
+    The rule learning algorithms present their suggested rules in a new view, in which the user can either copy 
+    the complete script or single rules to a new script file, where the rules can be further refined.
   </para>
-  <para> First, a suitable set of documents that contains the text fragments with patterns needs to be selected and annotated with the target concepts. Then, the knowledge
-    engineer chooses and configures the methods for automatic rule acquisition to the best of his
-    knowledge for the learning task: Lambda expressions based on tokens and linguistic features, for
-    example, differ in their application domain from wrappers that process generated HTML pages.
+  <para>
+    This section gives a short introduction about the included features and learners, and how to use the framework to learn UIMA Ruta rules. First, the 
+    available rule learning algorithms are introduced in <xref linkend="section.tools.ruta.workbench.textruler.learner"/>. Then, 
+    the user interface and the usage is explained in <xref linkend="section.tools.ruta.workbench.textruler.ui"/> using an exemplary UIMA Ruat project.
   </para>
-  <para> Furthermore, parameters like the window size defining relevant features need to be set to
-    an appropriate level. Before the annotated training documents form the input of the learning
-    task, they are enriched with features generated by the partial rule set of the developed
-    application. The result of the methods, which are the learned rules, are proposed to the knowledge
-    engineer for the extraction of the target concept.
-  </para>
-  <para> The knowledge engineer has different options to proceed: If the quality, amount or
-    generality of the presented rules is not sufficient, then additional training documents need to
-    be annotated or additional rules have to be handcrafted to provide more features in general or
-    more appropriate features. Rules or rule sets of high quality can be modified, combined or
-    generalized and transfered to the rule set of the application in order to support the extraction
-    task of the target concept. In the case that the methods did not learn reasonable rules at all,
-    the knowledge engineer proceeds with writing handcrafted rules.
-  </para>
-  <para> Having gathered enough extraction knowledge for the current concept, the semi-automatic
-    process is iterated and the focus is moved to the next concept until the development of the
-    application is completed.
-  </para>
-  <section id="ugr.tools.ruta.textruler.learner">
-    <title>Available Learners</title>
-    <para>
-      The available learners are based on the following publications:
-      <orderedlist numeration="arabic">
-      <!-- 
-        <listitem>
-          <para> Dayne Freitag and Nicholas Kushmerick. Boosted Wrapper Induction. In AAAI/IAAI,
-            pages 577-583, 2000.</para>
-        </listitem>
-       -->
-        <listitem>
-          <para> F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
-            Constraints. Technical Report CS-03-07, Department of Computer Science, University of
-            Sheffield, Sheffield, 2003.</para>
-        </listitem>
-        <listitem>
-          <para> Mary Elaine Califf and Raymond J. Mooney. Bottom-up Relational Learning of Pattern
-            Matching Rules for Information Extraction. Journal of Machine Learning Research,
-            4:177-210, 2003.</para>
-        </listitem>
-        <listitem>
-          <para> Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
-            Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
-            pages 233-272, 1999.</para>
-        </listitem>
-        <listitem>
-          <para> N. Kushmerick, D. Weld, and B. Doorenbos. Wrapper Induction for Information
-            Extraction. In Proc. IJC Artificial Intelligence, 1997.</para>
-        </listitem>
-      </orderedlist>
-    </para>
+   <section id="section.tools.ruta.workbench.textruler.learner">
+    <title>Included rule learning algorithms</title>
     <para>
-      Each available learner has several features. Their meaning is explained here:
-      <itemizedlist>
-        <listitem>
-          <para> Strategy: The used strategy of the learning methods are commonly coverage
-            algorithms.</para>
-        </listitem>
-        <listitem>
-          <para>
-            Document: The type of the document may be <quote>free</quote>
-            like in newspapers, <quote>semi</quote>
-            or <quote>struct</quote> like in HTML pages.
-   </para>
-        </listitem>
-        <listitem>
-          <para> Slots: The slots refer to a single annotation that represents the goal of the
-            learning task. Some rule are able to create several annotations at once in the same
-            context (multi-slot). However, only single slots are supported by the current
-            implementations.</para>
-        </listitem>
-        <listitem>
-          <para> Status: The current status of the implementation in the TextRuler framework.</para>
-        </listitem>
-      </itemizedlist>
-    </para>
-    <para>
-      The following table gives an overview:
-      <table id="table.ugr.tools.ruta.workbench.textruler.available_learners" frame="all">
-        <title>Overview of available learners</title>
-        <tgroup cols="6" colsep="1" rowsep="1">
-          <colspec colname="c1" colwidth="1*" />
-          <colspec colname="c2" colwidth="1*" />
-          <colspec colname="c3" colwidth="1*" />
-          <colspec colname="c4" colwidth="1*" />
-          <colspec colname="c5" colwidth="1*" />
-          <colspec colname="c6" colwidth="1*" />
-          <thead>
-            <row>
-              <entry align="center">Name</entry>
-              <entry align="center">Strategy</entry>
-              <entry align="center">Document</entry>
-              <entry align="center">Slots</entry>
-              <entry align="center">Status</entry>
-              <entry align="center">Publication</entry>
-            </row>
-          </thead>
-          <tbody>
-          <!-- 
-            <row>
-              <entry>BWI</entry>
-              <entry>Boosting, Top Down</entry>
-              <entry>Struct, Semi</entry>
-              <entry>Single, Boundary</entry>
-              <entry>Planning</entry>
-              <entry>1</entry>
-            </row>
-           -->
-            <row>
-              <entry>LP2</entry>
-              <entry>Bottom Up Cover</entry>
-              <entry>All</entry>
-              <entry>Single, Boundary</entry>
-              <entry>Prototype</entry>
-              <entry>1</entry>
-            </row>
-            <row>
-              <entry>RAPIER</entry>
-              <entry>Top Down/Bottom Up Compr.</entry>
-              <entry>Semi</entry>
-              <entry>Single</entry>
-              <entry>Experimental</entry>
-              <entry>2</entry>
-            </row>
-            <row>
-              <entry>WHISK</entry>
-              <entry>Top Down Cover</entry>
-              <entry>All</entry>
-              <entry>Multi</entry>
-              <entry>Prototype</entry>
-              <entry>3</entry>
-            </row>
-            <row>
-              <entry>WIEN</entry>
-              <entry>CSP</entry>
-              <entry>Struct</entry>
-              <entry>Multi, Rows</entry>
-              <entry>Prototype</entry>
-              <entry>4</entry>
-            </row>
-          </tbody>
-        </tgroup>
-      </table>
-    </para>
-    <!-- 
-    <section id="section.ugr.tools.ruta.workbench.textruler.bwi">
-      <title>BWI (Boosted Wrapper Induction)</title>
-      <para> BWI uses boosting techniques to improve the performance of simple pattern matching
-        single-slot boundary wrappers (boundary detectors). Two sets of detectors are learned: the
-        "fore" and the "aft" detectors. Weighted by their confidences and combined with a slot
-        length histogram derived from the training data they can classify a given pair of boundaries
-        within a document. BWI can be used for structured, semi-structured and free text. The
-        patterns are token-based with special wildcards for more general rules.   </para>
-      <para> Implementations No implementations are yet available.   </para>
-      <para> Parameters No parameters are yet available.   </para>
-    </section>
-     -->
-    <section id="section.ugr.tools.ruta.workbench.textruler.lp2">
+      This section gives a short description of the rule learning algorithms,
+      which are provided in the UIMA Ruta TextRuler framework.
+      </para>
+      <section id="section.tools.ruta.workbench.textruler.lp2">
       <title>LP2</title>
-      <para>This method operates on all three kinds of documents. It learns separate rules for
-        the beginning and the end of a single slot. Tagging rules insert boundary SGML
-        tags and, additionally, induced correction rules shift misplaced tags to their correct
-        positions in order to improve precision. The learning strategy is a bottom-up covering
+      <note>
+      <para>
+        This rule learner is an experimental implementation of the ideas and algorithms published in:
+        F. Ciravegna. (LP)2, Rule Induction for Information Extraction Using Linguistic
+        Constraints. Technical Report CS-03-07, Department of Computer Science, University of
+        Sheffield, Sheffield, 2003.
+      </para>
+      </note>
+      <para>This algorithms learns separate rules for
+        the beginning and the end of a single slot, which are later combined 
+        in order to identify the targeted annotation. The learning strategy is a bottom-up covering
         algorithm. It starts by creating a specific seed instance with a window of w tokens to the
-        left and right of the target boundary and searches for the best generalization. Other
-        linguistic NLP-features can be used in order to generalize over the flat word sequence.
+        left and right of the target boundary and searches for the best generalization. Additional context rules are
+        induced in order to identify missing boundaries. The current implementation does not support correction rules.
+        The TextRuler framework provides two versions of this algorithm: LP2 (naive) is a straightforward implementation
+        with limited expressiveness concerning the resulting Ruta rules. LP2 (optimized) is an improved 
+        version with a dynamic programming approach and is providing better results in general.
+        The following parameters are available. For a more detailed description of the parameters, 
+        please refer to the implementation and the publication.
       </para>
       <para>
-        Parameters: 
-      </para>
       <itemizedlist>
         <listitem>
           <para>Context Window Size (to the left and right)</para>
         </listitem>
         <listitem>
-          <para>Best Rules List Size: Minimum</para>
+          <para>Best Rules List Size</para>
         </listitem>
         <listitem>
-          <para>Covered Positives per Rule</para>
+          <para>Minimum Covered Positives per Rule</para>
         </listitem>
         <listitem>
           <para>Maximum Error Threshold</para>
@@ -225,55 +86,28 @@ under the License.
           <para>Contextual Rules List Size</para>
         </listitem>
       </itemizedlist>
-    </section>
-    <section id="section.ugr.tools.ruta.workbench.textruler.rapier">
-      <title>RAPIER</title>
-      <para>RAPIER induces single slot extraction rules for semi-structured documents. The rules
-        consist of three patterns: a pre-filler, a filler and a post-filler pattern. Each pattern can hold
-        several constraints on tokens and their according POS-tag- and semantic information. The
-        algorithm uses a bottom-up compression strategy starting with a most specific seed rule for
-        each training instance. This initial rule base is compressed by randomly selecting rule
-        pairs and search for the best generalization. Considering two rules, the least general
-        generalization (LGG) of the slot fillers are created and specialized by adding rule items to
-        the pre- and post-filler until the new rules operate well on the training set. The best of
-        the k rules (k-beam search) is added to the rule base and all empirically subsumed rules are
-        removed.   </para>
-      <para>
-        Parameters: 
       </para>
-      <itemizedlist>
-        <listitem>
-          <para>Parameters Maximum Compression Fail Count</para>
-        </listitem>
-        <listitem>
-          <para>Internal Rules List Size: Rule Pairs for Generalizing</para>
-        </listitem>
-        <listitem>
-          <para>Maximum 'No improvement' Count</para>
-        </listitem>
-        <listitem>
-          <para>Maximum Noise Threshold: Minimum Covered Positives Per Rule</para>
-        </listitem>
-        <listitem>
-          <para>PosTag Root Type</para>
-        </listitem>
-        <listitem>
-          <para>Use All 3 GenSets at Specialization</para>
-        </listitem>
-      </itemizedlist>
     </section>
-    <section id="section.ugr.tools.ruta.workbench.textruler.whisk">
+    
+    <section id="section.tools.ruta.workbench.textruler.whisk">
       <title>WHISK</title>
-      <para> WHISK is a multi-slot method that operates on all three kinds of documents and learns
-        single- or multi-slot rules looking similar to regular expressions. The top-down covering
-        algorithm begins with the most general rule and specializes it by adding single rule terms
-        until the rule does not make errors anymore on the training set. Domain specific classes or linguistic
-        information obtained by a syntactic analyzer can be used as additional features. The exact
-        definition of a rule term (e.g., a token) and of a problem instance (e.g., a whole document or
-        a single sentence) depends on the operating domain and document type.   </para>
+      <note>
       <para>
-        Parameters: 
+        This rule learner is an experimental implementation of the ideas and algorithms published in:
+        Stephen Soderland, Claire Cardie, and Raymond Mooney. Learning Information
+        Extraction Rules for Semi-Structured and Free Text. In Machine Learning, volume 34,
+        pages 233-272, 1999.
       </para>
+      </note>
+      <para>WHISK is a multi-slot method that operates on all three kinds of documents and learns
+        single- or multi-slot rules looking similar to regular expressions. However, the current implementation only support single slot rules.
+        The top-down covering algorithm begins with the most general rule and specializes it by adding single rule terms
+        until the rule does not make errors anymore on the training set. The TextRuler framework provides two versions of this algorithm:
+        WHISK (token) is a naive token-based implementation. WHISK (generic) is an optimized and improved implementation, 
+        which is able to refer to arbitrary annotations and also supports primitive features. The following parameters are available. For a more detailed description of the parameters, 
+        please refer to the implementation and the publication.
+        </para>
+      <para>
       <itemizedlist>
         <listitem>
           <para>Parameters Window Size</para>
@@ -284,17 +118,51 @@ under the License.
         <listitem>
           <para>PosTag Root Type</para>
         </listitem>
+        <listitem>
+          <para>Considered Features (comma-separated) - only WHISK (generic)</para>
+        </listitem>
       </itemizedlist>
-    </section>
-    <section id="section.ugr.tools.ruta.workbench.textruler.wien">
-      <title>WIEN </title>
-      <para> WIEN is the only method listed here that operates on highly structured texts only. It
-        induces wrappers that anchor the slots by their structured context.
-        The HLRT (head left right tail) wrapper class for example can determine and extract several
-        multi-slot-templates by first separating the important information block from unimportant
-        head and tail portions and extracting multiple data rows from table like data
-        structures from the remaining document. Inducing a wrapper is done by solving a CSP for all
-        possible pattern combinations from the training data.   </para>
-    </section>
-  </section>
-</section>
\ No newline at end of file
+      </para>
+    </section>  
+  </section>  
+   <section id="section.tools.ruta.workbench.textruler.ui">
+   <title>The TextRuler view</title>
+      <para> 
+      </para>
+      <figure id="figure.tools.ruta.workbench.textruler.main">
+      <title>The UIMA Ruta TextRuler framework
+      </title>
+      <mediaobject>
+        <imageobject role="html">
+          <imagedata width="776px" format="PNG" align="center"
+            fileref="&imgroot;textruler/textruler.png" />
+        </imageobject>
+        <imageobject role="fo">
+          <imagedata width="5.4in" format="PNG" align="center"
+            fileref="&imgroot;textruler/textruler.png" />
+        </imageobject>
+        <textobject>
+          <phrase>UIMA Ruta TextRuler framework</phrase>
+        </textobject>
+      </mediaobject>
+    </figure>
+    <figure id="figure.tools.ruta.workbench.textruler.pref">
+      <title>The UIMA Ruta TextRuler Preferences
+      </title>
+      <mediaobject>
+        <imageobject role="html">
+          <imagedata width="576px" format="PNG" align="center"
+            fileref="&imgroot;textruler/textruler_pref.png" />
+        </imageobject>
+        <imageobject role="fo">
+          <imagedata width="3.3in" format="PNG" align="center"
+            fileref="&imgroot;textruler/textruler_pref.png" />
+        </imageobject>
+        <textobject>
+          <phrase>UIMA Ruta TextRuler Preferences</phrase>
+        </textobject>
+      </mediaobject>
+    </figure>
+    
+   </section>
+</section>