You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by pk...@apache.org on 2012/11/30 13:51:26 UTC
svn commit: r1415605 - in /uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook: ./ images/tools/tm/language/ images/tools/tm/language/basic_token/ language/

Author: pkluegl
Date: Fri Nov 30 12:51:25 2012
New Revision: 1415605

URL: http://svn.apache.org/viewvc?rev=1415605&view=rev
Log:
UIMA-2285
- added sections about seed annotations
- fixed layout in syntax section
- fixed some typos in overview section
- removed old/wrong inference section
- improved remaining sections in language chapter

Added:
    uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tm/language/
    uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tm/language/basic_token/
    uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tm/language/basic_token/basic_token.png   (with props)
    uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.basic_annotations.xml
Modified:
    uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.syntax.xml
    uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.language.xml
    uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml

Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tm/language/basic_token/basic_token.png
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tm/language/basic_token/basic_token.png?rev=1415605&view=auto
==============================================================================
Binary file - no diff available.

Propchange: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tm/language/basic_token/basic_token.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.basic_annotations.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.basic_annotations.xml?rev=1415605&view=auto
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.basic_annotations.xml (added)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.basic_annotations.xml Fri Nov 30 12:51:25 2012
@@ -0,0 +1,239 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
+<!ENTITY imgroot "images/tools/tm/language/" >
+<!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >  
+%uimaents;
+]>
+<!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
+	license agreements. See the NOTICE file distributed with this work for additional 
+	information regarding copyright ownership. The ASF licenses this file to 
+	you under the Apache License, Version 2.0 (the "License"); you may not use 
+	this file except in compliance with the License. You may obtain a copy of 
+	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
+	by applicable law or agreed to in writing, software distributed under the 
+	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
+	OF ANY KIND, either express or implied. See the License for the specific 
+	language governing permissions and limitations under the License. -->
+
+<section id="ugr.tools.tm.language.seeding">
+	<title>Basic annotations and tokens</title>
+	<para>
+		The TextMarker system uses a JFlex lexer to initially create a
+		seed of
+		basic token annotations. These tokens build a hierarchy
+		which is shown
+		in
+		<xref linkend='figure.ugr.tools.tm.language.seeding.basic_token' />
+		. The
+		<quote>ALL</quote>
+		(green) annotation is the root of the hierarchy. ALL and the red
+		marked annotation types are abstract. This means that they are not
+		actually
+		created by the lexer. An overview of these abstract types can
+		be found in
+		<xref linkend='table.ugr.tools.tm.language.seeding.basic_token.abstract' />
+		. The leafs of the hierarchy (blue) are created by the lexer. Each
+		leaf is
+		an own type but also inherits the types of the abstract
+		annotation types further up in the hierarchy. The leaf types are
+		described in
+		more detail in
+		<xref linkend='table.ugr.tools.tm.language.seeding.basic_token.created' />
+		Each text unit within an input
+		document belongs to exactly one of these
+		annotation types.
+	</para>
+	<para>
+		<figure id="figure.ugr.tools.tm.language.seeding.basic_token">
+			<title>Basic token hierarchy
+			</title>
+			<mediaobject>
+				<imageobject role="html">
+					<imagedata width="576px" format="PNG" align="center"
+						fileref="&imgroot;basic_token/basic_token.png" />
+				</imageobject>
+				<imageobject role="fo">
+					<imagedata width="5.5in" format="PNG" align="center"
+						fileref="&imgroot;basic_token/basic_token.png" />
+				</imageobject>
+				<textobject>
+					<phrase>
+						Basic token hierarchy.
+					</phrase>
+				</textobject>
+			</mediaobject>
+		</figure>
+	</para>
+	<para>
+		<table id="table.ugr.tools.tm.language.seeding.basic_token.abstract"
+			frame="all">
+			<title>Abstract annotations</title>
+			<tgroup cols="3" colsep="1" rowsep="1">
+				<colspec colname="c1" colwidth="1*" />
+				<colspec colname="c2" colwidth="1*" />
+				<colspec colname="c3" colwidth="3*" />
+				<thead>
+					<row>
+						<entry align="center">Annotation</entry>
+						<entry align="center">Parent</entry>
+						<entry align="center">Description</entry>
+					</row>
+				</thead>
+				<tbody>
+					<row>
+						<entry>ALL</entry>
+						<entry>-</entry>
+						<entry>parent type of all tokens</entry>
+					</row>
+					<row>
+						<entry>ANY</entry>
+						<entry>ALL</entry>
+						<entry>all token but markup</entry>
+					</row>
+					<row>
+						<entry>W</entry>
+						<entry>ANY</entry>
+						<entry>all kinds of words</entry>
+					</row>
+					<row>
+						<entry>PM</entry>
+						<entry>ANY</entry>
+						<entry>all kinds of punctuation marks</entry>
+					</row>
+					<row>
+						<entry>WS</entry>
+						<entry>ANY</entry>
+						<entry>all kinds of white spaces</entry>
+					</row>
+					<row>
+						<entry>SENTENCEEND</entry>
+						<entry>PM</entry>
+						<entry>all kinds of punctuation marks that indicate the end of a
+							sentence
+						</entry>
+					</row>
+				</tbody>
+			</tgroup>
+		</table>
+	</para>
+	<para>
+		<table id="table.ugr.tools.tm.language.seeding.basic_token.created"
+			frame="all">
+			<title>Annotations created by lexer</title>
+			<tgroup cols="4" colsep="1" rowsep="1">
+				<colspec colname="c1" colwidth="1*" />
+				<colspec colname="c2" colwidth="1*" />
+				<colspec colname="c3" colwidth="1*" />
+				<colspec colname="c4" colwidth="1*" />
+
+				<thead>
+					<row>
+						<entry align="center">Annotation</entry>
+						<entry align="center">Parent</entry>
+						<entry align="center">Description</entry>
+						<entry align="center">Example</entry>
+					</row>
+				</thead>
+				<tbody>
+					<row>
+						<entry>MARKUP</entry>
+						<entry>ALL</entry>
+						<entry>HTML and XML elements</entry>
+						<entry><![CDATA[<p class="Headline">]]></entry>
+					</row>
+					<row>
+						<entry>NBSP</entry>
+						<entry>ANY</entry>
+						<entry>non breaking space</entry>
+						<entry><![CDATA[&nbsp;]]></entry>
+					</row>
+					<row>
+						<entry>AMP</entry>
+						<entry>ANY</entry>
+						<entry>ampersant expression</entry>
+						<entry><![CDATA[Ã¤]]></entry>
+					</row>
+					<row>
+						<entry>BREAK</entry>
+						<entry>WS</entry>
+						<entry>line break</entry>
+						<entry><![CDATA[\n]]></entry>
+					</row>
+					<row>
+						<entry>SPACE</entry>
+						<entry>WS</entry>
+						<entry>spaces</entry>
+						<entry><![CDATA[" "]]></entry>
+					</row>
+					<row>
+						<entry>COLON</entry>
+						<entry>PM</entry>
+						<entry>colon</entry>
+						<entry><![CDATA[:]]></entry>
+					</row>
+					<row>
+						<entry>COMMA</entry>
+						<entry>PM</entry>
+						<entry>comma</entry>
+						<entry><![CDATA[,]]></entry>
+					</row>
+					<row>
+						<entry>PERIOD</entry>
+						<entry>SENTENCEEND</entry>
+						<entry>period</entry>
+						<entry><![CDATA[.]]></entry>
+					</row>
+					<row>
+						<entry>EXCLAMATION</entry>
+						<entry>SENTENCEEND</entry>
+						<entry>exclamation mark</entry>
+						<entry><![CDATA[!]]></entry>
+					</row>
+					<row>
+						<entry>SEMICOLON</entry>
+						<entry>PM</entry>
+						<entry>semicolon</entry>
+						<entry><![CDATA[;]]></entry>
+					</row>
+					<row>
+						<entry>QUESTION</entry>
+						<entry>SENTENCEEND</entry>
+						<entry>question mark</entry>
+						<entry><![CDATA[?]]></entry>
+					</row>
+					<row>
+						<entry>SW</entry>
+						<entry>W</entry>
+						<entry>lower case work</entry>
+						<entry><![CDATA[annotation]]></entry>
+					</row>
+					<row>
+						<entry>CW</entry>
+						<entry>W</entry>
+						<entry>work starting with one capitalized letter</entry>
+						<entry><![CDATA[Annotation]]></entry>
+					</row>
+					<row>
+						<entry>CAP</entry>
+						<entry>W</entry>
+						<entry>word only containing capitalized letters</entry>
+						<entry><![CDATA[ANNOTATION]]></entry>
+					</row>
+					<row>
+						<entry>NUM</entry>
+						<entry>ANY</entry>
+						<entry>sequence of digits</entry>
+						<entry><![CDATA[0123]]></entry>
+					</row>
+					<row>
+						<entry>SPECIAL</entry>
+						<entry>ANY</entry>
+						<entry>all other tokens and symbols</entry>
+						<entry><![CDATA[/]]></entry>
+					</row>
+				</tbody>
+			</tgroup>
+		</table>
+	</para>
+</section>
\ No newline at end of file

Modified: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.syntax.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.syntax.xml?rev=1415605&r1=1415604&r2=1415605&view=diff
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.syntax.xml (original)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/language/tools.textmarker.language.syntax.xml Fri Nov 30 12:51:25 2012
@@ -26,8 +26,7 @@
 		Structure: The overall structure of a TextMarker
 		script is defined by
 		the following syntax.
-		<programlisting><![CDATA[
-Script              -> PackageDeclaration GlobalStatements Statements
+		<programlisting><![CDATA[Script              -> PackageDeclaration GlobalStatements Statements
 PackageDeclaration  -> "PACKAGE" DottedIdentifier ";"
 GlobalStatments     -> GlobalStatement*   
 GlobalStatment      -> ("TYPESYSTEM" | "SCRIPT" | "ENGINE") 
@@ -41,8 +40,7 @@ Statement           -> Declaration | Var
 	</para>
 	<para>
 		Example beginning of a TextMarker file:
-		<programlisting><![CDATA[
-PACKAGE de.uniwue.example;
+		<programlisting><![CDATA[PACKAGE de.uniwue.example;
 
 // import the types of this type system 
 // (located in the descriptor folder -> de.uniwue.example folder)
@@ -54,8 +52,7 @@ SCRIPT de.uniwue.example.Year;
 ]]></programlisting>
 
 		Syntax of declarations:
-		<programlisting><![CDATA[
-Declaration  ->  "DECLARE" (AnnotationType)? Identifier ("," Identifier )*
+		<programlisting><![CDATA[Declaration  ->  "DECLARE" (AnnotationType)? Identifier ("," Identifier )*
                  | "DECLARE" AnnotationType Identifier ( "(" 
                  FeatureDeclaration ")" )?
 FeatureDeclaration  -> ( (AnnotationType | "STRING" | "INT" | "FLOAT"
@@ -83,19 +80,20 @@ BasicAnnotationType ->  ('COLON'| 'SW' |
                         | 'EXCLAMATION' | 'SEMICOLON' | 'NBSP'| 'AMP' | '_' 
                         | 'SENTENCEEND' | 'W' | 'PM' | 'ANY' | 'ALL' 
                         | 'SPACE' | 'BREAK') 
-BlockDeclaration       -> "BLOCK" "(" Identifier ")" RuleElementType 
+BlockDeclaration       -> "BLOCK" "(" Identifier ")" RuleElementWithCA 
                                                          "{" Statements "}"
-AutomataDeclaration    -> "RULES" "(" Identifier ")" RuleElementType 
+AutomataDeclaration    -> "RULES" "(" Identifier ")" RuleElementWithCA 
                                                          "{" Statements "}"
 ]]></programlisting>
 
 		Syntax of statements and rule elements
-		<programlisting><![CDATA[
-SimpleStatement        -> RuleElements ";"
+		<programlisting><![CDATA[SimpleStatement        -> RuleElements ";"
 RuleElements           -> RuleElement+
 RuleElement            -> RuleElementType | RuleElementLiteral 
                           | RuleElementComposed | RuleElementDisjunctive
 RuleElementType        ->  TypeExpression QuantifierPart? 
+                                           ("{" Conditions?  Actions? "}")?
+RuleElementWithCA      ->  TypeExpression QuantifierPart? 
                                               "{" Conditions?  Actions? "}"
 RuleElementLiteral     ->  SimpleStringExpression QuantifierPart? 
                                               "{" Conditions?  Actions? "}"
@@ -120,8 +118,7 @@ Actions                -> "->" Action ( 
 	</para>
 	<para>
 		Identifier
-		<programlisting><![CDATA[
-DottedIdentifier    ->  Identifier ("." Identifier)*
+		<programlisting><![CDATA[DottedIdentifier    ->  Identifier ("." Identifier)*
 DottedIdentifier2   ->  Identifier (("."|"-") Identifier)*
 Identifier          ->  letter (letter|digit)*
 ]]></programlisting>

Modified: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.language.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.language.xml?rev=1415605&r1=1415604&r2=1415605&view=diff
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.language.xml (original)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.language.xml Fri Nov 30 12:51:25 2012
@@ -1,239 +1,194 @@
 <?xml version="1.0" encoding="UTF-8"?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
 "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd"[
-<!ENTITY imgroot "images/tools/tools.textmarker/" >
+<!ENTITY imgroot "images/tools/tm/language/" >
 <!ENTITY % uimaents SYSTEM "../../target/docbook-shared/entities.ent" >  
 %uimaents;
 ]>
 <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor 
-	license agreements. See the NOTICE file distributed with this work for additional 
-	information regarding copyright ownership. The ASF licenses this file to 
-	you under the Apache License, Version 2.0 (the "License"); you may not use 
-	this file except in compliance with the License. You may obtain a copy of 
-	the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
-	by applicable law or agreed to in writing, software distributed under the 
-	License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
-	OF ANY KIND, either express or implied. See the License for the specific 
-	language governing permissions and limitations under the License. -->
+  license agreements. See the NOTICE file distributed with this work for additional 
+  information regarding copyright ownership. The ASF licenses this file to 
+  you under the Apache License, Version 2.0 (the "License"); you may not use 
+  this file except in compliance with the License. You may obtain a copy of 
+  the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required 
+  by applicable law or agreed to in writing, software distributed under the 
+  License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS 
+  OF ANY KIND, either express or implied. See the License for the specific 
+  language governing permissions and limitations under the License. -->
 
 <chapter id="ugr.tools.tm.language.language">
-	<title>TextMarker Language</title>
-	<para>
-		This chapter provides a complete description of the TextMarker
-		language.
-	</para>
-
-	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
-		href=".\language\tools.textmarker.language.syntax.xml" />
-
-	<section id="ugr.tools.tm.language.inference">
-		<title>Inference</title>
-		<para>
-			The inference relies on a complete, disjunctive partition of the
-			document. A basic (minimal) annotation for each element of the
-			partition is assigned to a type of a hierarchy. These basic
-			annotations are enriched for performance reasons with information
-			about annotations that start at the same offset or overlap with the
-			basic annotation. Normally, a scanner creates a basic annotation for
-			each token, punctuation or whitespace, but can also be replaced with
-			a different annotation seeding strategy. Unlike other rule-based
-			information extraction language, the rules are executed in an
-			imperative way. Experience has shown that the dependencies between
-			rules, e.g., the same annotation types in the action and in the
-			condition of a different rule, often form tree-like and not
-			graph-like structures. Therefore, the sequencing and imperative
-			processing did not cause disadvantages, but instead obvious
-			advantages, e.g., the improved understandability of large rule sets.
-			The following algorithm summarizes the rule inference:
-			<programlisting><![CDATA[
-collect all basic annotations that fulfill the first matching condition
-  for all collected basic annotations do
-    for all rule elements of current rule do
-    if quantifier wants to match then
-      match the conditions of the rule element on the current basic annotation
-      determine the next basic annotation after the current match
-      if quantifier wants to continue then
-        if there is a next basic annotation then
-          continue with the current rule element and the next basic annotation
-        else if rule element did not match then
-          reset the next basic annotation to the current one
-      set the current basic annotation to the next one
-      if some rule elements did not match then
-        stop and continue with the next collected basic annotation
-      else if there is no current basic annotation and the quantifier wants to continue then
-        set the current basic annotation to the previous one
-  if all rule elements matched then
-    execute the actions of all rule elements
-]]></programlisting>
-			The rule elements can of course match on all kinds of annotations.
-			Therefore the determination of the next basic annotation returns the
-			first basic annotation after the last basic annotation of the
-			complete, matched annotation.
-
-		</para>
-	</section>
-
-
-	<section id="ugr.tools.tm.language.seeding">
-		<title>Basic annotations and tokens</title>
-		<para>
-			The TextMarker system uses a JFlex lexer to initially create a
-			seed of
-			basic token annotations. These tokens build a hierarchy
-			which is shown
-			in
-			<xref linkend='figure.ugr.tools.tm.language.seeding.basic_token' />
-			.
-		</para>
-		<para>
-			<figure id="figure.ugr.tools.tm.language.seeding.basic_token">
-				<title>Basic token hierarchy
-				</title>
-				<mediaobject>
-					<imageobject role="html">
-						<imagedata width="576px" format="PNG" align="center"
-							fileref="&imgroot;overview/screenshot_tm_perspective_.png" />
-					</imageobject>
-					<imageobject role="fo">
-						<imagedata width="5.5in" format="PNG" align="center"
-							fileref="&imgroot;overview/screenshot_tm_perspective_.png" />
-					</imageobject>
-					<textobject>
-						<phrase>
-							Basic token hierarchy.
-						</phrase>
-					</textobject>
-				</mediaobject>
-			</figure>
-		</para>
-		<para>
-			<table id="table.ugr.tools.tm.language.seeding.basic_token.abstract"
-				frame="all">
-				<title>Abstract annotations</title>
-				<tgroup cols="5" colsep="1" rowsep="1">
-					<colspec colname="c1" colwidth="1*" />
-					<colspec colname="c2" colwidth="1*" />
-					<colspec colname="c3" colwidth="1*" />
-					<thead>
-						<row>
-							<entry align="center">Token</entry>
-							<entry align="center">Parent</entry>
-							<entry align="center">Description</entry>
-						</row>
-					</thead>
-					<tbody>
-						<row>
-							<entry></entry>
-							<entry></entry>
-							<entry></entry>
-						</row>
-					</tbody>
-				</tgroup>
-			</table>
-		</para>
-		<para>
-			<table id="table.ugr.tools.tm.language.seeding.basic_token.created"
-				frame="all">
-				<title>Annotations created by lexer</title>
-				<tgroup cols="5" colsep="1" rowsep="1">
-					<colspec colname="c1" colwidth="1*" />
-					<colspec colname="c2" colwidth="1*" />
-					<colspec colname="c3" colwidth="1*" />
-					<colspec colname="c4" colwidth="1*" />
-					<colspec colname="c5" colwidth="1*" />
-					<thead>
-						<row>
-							<entry align="center">Token</entry>
-							<entry align="center">Parent</entry>
-							<entry align="center">Description</entry>
-							<entry align="center">JFlex expression</entry>
-							<entry align="center">Example</entry>
-						</row>
-					</thead>
-					<tbody>
-						<row>
-							<entry></entry>
-							<entry></entry>
-							<entry></entry>
-							<entry></entry>
-							<entry></entry>
-						</row>
-					</tbody>
-				</tgroup>
-			</table>
-		</para>
-	</section>
-
-	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
-		href=".\language\tools.textmarker.language.quantifier.xml" />
-	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
-		href=".\language\tools.textmarker.language.declarations.xml" />
-	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
-		href=".\language\tools.textmarker.language.expressions.xml" />
-	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
-		href=".\language\tools.textmarker.language.conditions.xml" />
-	<xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
-		href=".\language\tools.textmarker.language.actions.xml" />
-
-
-	<section id="ugr.tools.tm.language.filtering">
-		<title>Robust extraction using filtering</title>
-		<para>
-			Rule based or pattern based information extraction systems often
-			suffer from unimportant fill words, additional whitespace and
-			unexpected markup. The TextMarker System enables the knowledge
-			engineer to filter and to hide all possible combinations of
-			predefined and new types of annotations. Additionally, it can
-			differentiate between every kind of HTML markup and XML tags. The
-			visibility of tokens and annotations is modified by the actions of
-			rule elements and can be conditioned using the complete
-			expressiveness of the language. Therefore the TextMarker system
-			supports a robust approach to information extraction and simplifies
-			the creation of new rules since the knowledge engineer can focus on
-			important textual features. If no rule action changed the
-			configuration of the filtering settings, then the default filtering
-			configuration ignores whitespaces and markup. Using the default
-			setting, the following rule matches all four types of input in this
-			example:
-			<programlisting><![CDATA[
+  <title>TextMarker Language</title>
+  <para>
+    This chapter provides a complete description of the TextMarker
+    language.
+  </para>
+
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
+    href=".\language\tools.textmarker.language.syntax.xml" />
+
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
+    href=".\language\tools.textmarker.language.basic_annotations.xml" />
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
+    href=".\language\tools.textmarker.language.quantifier.xml" />
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
+    href=".\language\tools.textmarker.language.declarations.xml" />
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
+    href=".\language\tools.textmarker.language.expressions.xml" />
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
+    href=".\language\tools.textmarker.language.conditions.xml" />
+  <xi:include xmlns:xi="http://www.w3.org/2001/XInclude"
+    href=".\language\tools.textmarker.language.actions.xml" />
+
+
+  <section id="ugr.tools.tm.language.filtering">
+    <title>Robust extraction using filtering</title>
+    <para>
+      Rule based or pattern based information extraction systems often
+      suffer from unimportant fill words, additional whitespace and
+      unexpected markup. The TextMarker System enables the knowledge
+      engineer to filter and to hide all possible combinations of
+      predefined and new types of annotations. The
+      visibility of tokens and
+      annotations is modified by the actions of
+      rule elements and can be
+      conditioned using the complete
+      expressiveness of the language.
+      Therefore the TextMarker system
+      supports a robust approach to
+      information extraction and simplifies
+      the creation of new rules since
+      the knowledge engineer can focus on
+      important textual features. If no
+      rule action changed the
+      configuration of the filtering settings, then
+      the default filtering
+      configuration ignores whitespaces and markup.
+      Look at the following rule:
+      <programlisting><![CDATA["Dr" PERIOD CW CW
+]]></programlisting>
+      Using the default
+      setting, this rule matches on all four lines
+      of this
+      input document:
+      <programlisting><![CDATA[Dr. Joachim Baumeister
+Dr . Joachim      Baumeister
+Dr. <b><i>Joachim</i> Baumeister</b>
+Dr.JoachimBaumeister
+]]></programlisting>
+    </para>
+    <para>
+      To change the default setting use the
+      <quote>FILTERTYPE</quote>
+      or
+      <quote>RETAINTYPE</quote>
+      action. For example if markups should no longer be ignored, try the
+      following example on the above input document:
+      <programlisting><![CDATA[Document{->RETAINTYPE(MARKUP)};
 "Dr" PERIOD CW CW
 ]]></programlisting>
-			<programlisting><![CDATA[
-Dr. Peter Steinmetz
-Dr . Peter      Steinmetz
-Dr. <b><i>Peter</i> Steinmetz</b>
-Dr.PeterSteinmetz
-]]></programlisting>
-		</para>
-	</section>
-	<section id="ugr.tools.tm.language.blocks">
-		<title>Blocks</title>
-		<para>
-			Blocks combine some more complex control structures in the
-			TextMarker
-			language: conditioned statement, loops and procedures.
-
-
-			The
-			rule
-			element
-			in the definition of a block has to define a
-			condition/action
-			part,
-			even if that part is empty (LCURLY and
-			RCULRY).
-
-
-			A block can use
-			normal
-			conditions to condition the execution
-			of its
-			containing rules.
-
-			Examples:
-
-			<programlisting><![CDATA[
-DECLARE Month;
+      You will see, that the third line of the previous input example will
+      no longer be matched.
+    </para>
+    <para>
+      To filter types try the following on the input document:
+      <programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
+"Dr" CW CW
+]]></programlisting>
+      Since periods are ignored now, the rule will match on all four lines
+      of the example.
+    </para>
+    <para>
+      Notice that using a filtered annotation type within a
+      rule, prevents
+      this rule from being executed. Try the following:
+      <programlisting><![CDATA[Document{->FILTERTYPE(PERIOD)};
+"Dr" PERIOD CW CW
+]]></programlisting>
+      You will see that this matches on no line of the input document since
+      the second rule uses the filtered type PERIOD and is therefore not
+      executed.
+    </para>
+  </section>
+  <section id="ugr.tools.tm.language.blocks">
+    <title>Blocks</title>
+    <para>
+      Blocks combine some more complex control structures in the
+      TextMarker
+      language:
+      <orderedlist numeration="arabic">
+        <listitem>
+          <para>
+            Conditioned statements
+          </para>
+        </listitem>
+        <listitem>
+          <para>
+            <quote>Foreach</quote>
+            -Loops
+          </para>
+        </listitem>
+        <listitem>
+          <para>
+            Procedures
+          </para>
+        </listitem>
+      </orderedlist>
+    </para>
+    <para>
+      Declaration of a block:
+      <programlisting><![CDATA[BlockDeclaration       -> "BLOCK" "(" Identifier ")" RuleElementWithCA
+                                                       "{" Statements "}"
+RuleElementWithCA      ->  TypeExpression QuantifierPart? 
+                                              "{" Conditions?  Actions? "}"
+]]></programlisting>
+      A block declaration always starts with the keyword
+      <quote>BLOCK</quote>
+      , followed by the identifier of the block within brackets. The
+      <quote>RuleElementType</quote>
+      -element
+      is a TextMarker rule that consists of exactly one rule
+      element. The
+      rule element has to be a declared annotation type.
+      <note>
+        <para>
+          The
+          rule element in the definition of a block has to define
+          a
+          condition/action part, even if that part is empty (LCURLY and
+          RCULRY).
+        </para>
+      </note>
+    </para>
+    <para>
+      Through the rule element a new local document is defined, whose
+      scope
+      is the related block. So if you use
+      <literal>Document</literal>
+      within a block, this always refers to the locally limited document.
+      <programlisting><![CDATA[BLOCK(ForEach) Paragraph{} {
+    Document{COUNT(CW)}; // Here "Document" is limited to a Paragraph;
+               // therefore the rule only counts the CW annotations
+               // within the Paragraph
+}
+]]></programlisting>
+    </para>
+    <para>
+      A block is always executed when the TextMarker interpreter
+      reaches its
+      declaration. But a block may also be called from another
+      position of
+      the script. See
+      <xref linkend='ugr.tools.tm.language.blocks.procedure' />
+    </para>
+    <section id="ugr.tools.tm.language.blocks.condition">
+      <title>Conditioned statements</title>
+      <para>
+        A block can use common TextMarker conditions to condition the
+        execution of its containing rules.
+      </para>
+      <para>
+        Examples:
+        <programlisting><![CDATA[DECLARE Month;
 
 BLOCK(EnglishDates) Document{FEATURE("language", "en")} {
     Document{->MARKFAST(Month,'englishMonthNames.txt')};
@@ -245,113 +200,203 @@ BLOCK(GermanDates) Document{FEATURE("lan
     //...
 }
 ]]></programlisting>
+        The example is explained in detail in
+        <xref linkend='ugr.tools.tm.overview.examples' />
+        .
+      </para>
+    </section>
+    <section id="ugr.tools.tm.language.blocks.foreach">
+      <title>
+        <quote>Foreach</quote>
+        -Loops
+      </title>
+      <para>
+        A block can be used to execute the containing rules on a sequence of
+        similar text passages, therefore representing a
+        <quote>foreach</quote>
+        like loop.
+      </para>
+      <para>
+        Examples:
+        <programlisting><![CDATA[DECLARE SentenceWithNoLeadingNP;
+BLOCK(ForEach) Sentence{} {
+    Document{-STARTSWITH(NP) -> MARK(SentenceWithNoLeadingNP)};
+}
+]]></programlisting>
+        The example is explained in detail in
+        <xref linkend='ugr.tools.tm.overview.examples' />
+        .
+      </para>
+      <para>
+        This construction is especially useful, if you have a set of rules
+        which has to be executed continously on the same part of an input
+        document. Lets assume you have already annotated your document with
+        Paragraph annotations. Now you want to count the number of words
+        within each paragraph and if the number of words is bigger than 500
+        annotate it as BigParagraph. Therefore you wrote the following
+        rules:
+        <programlisting><![CDATA[DECLARE BigParagraph;
+INT numberOfWords;
+Paragraph{COUNT(W,numberOfWords)};
+Paragraph{IF(numberOfWords > 500) -> MARK(BigParagraph)};
+]]></programlisting>
+        This will not work. The reason is that the rule which counts the
+        number of words within a Paragraph is executed on all Paragraphs
+        before the last rule which marks the Paragraph as BigParagraph is
+        even executed once. Therefore when reaching the last rule in this
+        example, the variable
+        <literal>numberOfWords</literal>
+        holds the
+        number of words of the last Paragraph in the input
+        document,
+        thus annotating all Paragraphs either as BigParagraph or
+        not.
+      </para>
+      <para>
+        To solve this, use a block to tie the
+        execution of this rules
+        together for each Paragraph:
+        <programlisting><![CDATA[DECLARE BigParagraph;
+INT numberOfWords;
+BLOCK(IsBig) Paragraph{} {
+  Document{COUNT(W,numberOfWords)};
+  Document{IF(numberOfWords > 500) -> MARK(BigParagraph)};
+}
+]]></programlisting>
+        Since the scope of the Document is limited to a Paragraph within
+        the
+        block, the rule which counts the words is only executed once
+        before
+        the second rule decides if the Paragraph is a BigParagraph.
+        Of course
+        this is done for every Paragraph in the whole document.
+      </para>
+    </section>
+    <section id="ugr.tools.tm.language.blocks.procedure">
+      <title>Procedures</title>
+      <para>
+        Blocks can be used to introduce procedures into TextMarker language.
+        To do this declare a block as before. Lets assume you want to
+        simulate a procedure
+        <programlisting><![CDATA[public int countAmountOfTypesInDocument(Type type){
+    int amount = 0;
+    for(Token token : Document) {
+      if(token.isType(type)){
+        amount++;
+      }
+    }
+    return amount;
+} 
+
+public static void main() {
+  int amount = countAmountOfTypesInDocument(Paragraph));
+}            
+]]></programlisting>
+        which counts the number of the passed type wihtin the document and
+        gives back the counted number. This can be done in the following
+        way:
+        <programlisting><![CDATA[BOOLEAN executeProcedure = false;
+TYPE type;
+INT amount;
 
-
-			A block can be used to execute the containing rule on a sequence of
-			similar text passages.
-
-			Examples:
-			<programlisting><![CDATA[
-BLOCK(Paragraphs) Paragraphs{} { // <- limit the local view on the document: defines a local document
-    // This rule will be executed for each Paragraph that can be found in the current document.
-    Document{CONTAINS(Keyword)->MARK(SpecialParagraph)}; 
-    // Here, Document represents not the complete input document, but each Paragraph defined by the block statement.
+BLOCK(countNumberOfTypesInDocument) Document{IF(executeProcedure)} {
+    Document{COUNT(type, amount)};
 }
+
+Document{->ASSIGN(executeProcedure, true)};
+Document{->ASSIGN(type, Paragraph)};
+Document{->CALL(MyScript.countNumberOfTypesInDocument)};
 ]]></programlisting>
-		</para>
-	</section>
-	<section id="ugr.tools.tm.language.score">
-		<title>Heuristic extraction using scoring rules</title>
-		<para>
-			Diagnostic scores are a well known and successfully applied
-			knowledge
-			formalization pattern for diagnostic problems. Single known
-			findings
-			valuate a possible solution by adding or subtracting points
-			on an
-			account of that solution. If the sum exceeds a given threshold,
-			then
-			the solution is derived. One of the advantages of this pattern
-			is the
-			robustness against missing or false findings, since a high
-			number of
-			findings is used to derive a solution.
-
-			The TextMarker system
-			tries to
-			transfer this diagnostic problem
-			solution
-			strategy to the
-			information
-			extraction problem. In addition to a
-			normal creation of a
-			new
-			annotation, a MARK action can add positive
-			or negative scoring
-			points
-			to the text fragments matched by the rule
-			elements. If the
-			amount of
-			points exceeds the defined threshold for
-			the respective
-			type, then a
-			new annotation will be created. Further,
-			the current
-			value of heuristic
-			points of a possible annotation can
-			be
-			evaluated by
-			the SCORE condition.
-			In the following, the heuristic
-			extraction using
-			scoring rules is
-			demonstrated by a short example:
-
-			<programlisting><![CDATA[
-            Paragraph{CONTAINS(W,1,5)->MARKSCORE(5,Headline)};
-            Paragraph{CONTAINS(W,6,10)->MARKSCORE(2,Headline)};
-            Paragraph{CONTAINS(Emph,80,100,true)->MARKSCORE(7,Headline)};
-            Paragraph{CONTAINS(Emph,30,80,true)->MARKSCORE(3,Headline)};
-            Paragraph{CONTAINS(CW,50,100,true)->MARKSCORE(7,Headline)};
-            Paragraph{CONTAINS(W,0,0)->MARKSCORE(-50,Headline)};
-            Headline{SCORE(10)->MARK(Realhl)};
-            Headline{SCORE(5,10)->LOG("Maybe a headline")};
-                ]]></programlisting>
-
-
-			In the first part of this rule set, annotations of the type
-			paragraph
-			receive scoring points for a headline annotation, if they
-			fulfill
-			certain CONTAINS conditions. The first condition, for
-			example,
-			evaluates to true, if the paragraph contains one word up to
-			five
-			words, whereas the fourth conditions is fulfilled, if the
-			paragraph
-			contains thirty up to eighty percent of emph annotations.
-			The last
-			two
-			rules finally execute their actions, if the score of a
-			headline
-			annotation exceeds ten points, or lies in the interval of
-			five and
-			ten
-			points, respectively.
-		</para>
-	</section>
-	<section id="ugr.tools.tm.language.modification">
-		<title>Modification</title>
-		<para>
-			There are different actions that can modify the input document,
-			like DEL,
-			COLOR and REPLACE. But the input document itself can not be
-			modified
-			directly. A seperate engine, the Modifier.xml, has to be
-			called in
-			order to create another cas view with the name "modified".
-			In that
-			document all modifications are executed.
-		</para>
-	</section>
+        The boolean variable
+        <literal>executeProcedure</literal>
+        is used to prohibit the execution of the block when the interpreter
+        first reaches the block since this is no procedure call. The block can be called 
+                by referring to it with its name, preceded by the name of the script the 
+                block is defined in. In this exmaple, the script is called MyScript.tm.
+      </para>
+    </section>
+
+  </section>
+  <section id="ugr.tools.tm.language.score">
+    <title>Heuristic extraction using scoring rules</title>
+    <para>
+      Diagnostic scores are a well known and successfully applied
+      knowledge formalization pattern for diagnostic problems. Single known
+      findings valuate a possible solution by adding or subtracting points
+      on an account of that solution. If the sum exceeds a given threshold,
+      then the solution is derived. One of the advantages of this pattern
+      is the robustness against missing or false findings, since a high
+      number of findings is used to derive a solution.
+
+      The TextMarker system tries to transfer this diagnostic problem
+      solution strategy to the
+      information  extraction problem. In addition to a
+      normal creation of a new annotation, a MARKSCORE action can add positive
+      or negative scoring  points to the text fragments matched by the rule
+      elements. The current value of heuristic points of an annotation can
+      be evaluated by  the SCORE condition, which can be used in an additional rule to create another annotation.
+      In the following, the heuristic  extraction using
+      scoring rules is demonstrated by a short example:
+
+      <programlisting><![CDATA[Paragraph{CONTAINS(W,1,5)->MARKSCORE(5,Headline)};
+Paragraph{CONTAINS(W,6,10)->MARKSCORE(2,Headline)};
+Paragraph{CONTAINS(Emph,80,100,true)->MARKSCORE(7,Headline)};
+Paragraph{CONTAINS(Emph,30,80,true)->MARKSCORE(3,Headline)};
+Paragraph{CONTAINS(CW,50,100,true)->MARKSCORE(7,Headline)};
+Paragraph{CONTAINS(W,0,0)->MARKSCORE(-50,Headline)};
+Headline{SCORE(10)->MARK(Realhl)};
+Headline{SCORE(5,10)->LOG("Maybe a headline")};]]></programlisting>
+
+
+      In the first part of this rule set, annotations of the type
+      paragraph receive scoring points for a headline annotation, if they
+      fulfill  certain CONTAINS conditions. The first condition, for
+      example, evaluates to true, if the paragraph contains one word up to
+      five words, whereas the fourth conditions is fulfilled, if the
+      paragraph contains thirty up to eighty percent of emph annotations.
+      The last two rules finally execute their actions, if the score of a
+      headline annotation exceeds ten points, or lies in the interval of
+      five and ten  points, respectively.
+    </para>
+  </section>
+  <section id="ugr.tools.tm.language.modification">
+    <title>Modification</title>
+    <para>
+      There are different actions that can modify the input document,
+      like DEL,
+      COLOR and REPLACE. But the input document itself can not be
+      modified
+      directly. A separate engine, the Modifier.xml, has to be
+      called in
+      order to create another cas view with the name "modified".
+      In that
+      document all modifications are executed.
+    </para>
+    <para>
+      The following example shows how to import and call the Modifier.xml
+      engine.
+      The example is explained in detail in
+      <xref linkend='ugr.tools.tm.overview.examples' />
+      .
+    </para>
+    <programlisting><![CDATA[ENGINE utils.Modifier;
+Date{-> DEL};
+MoneyAmount{-> REPLACE("<MoneyAmount/>")};
+Document{-> COLOR(Headline, "green")};
+Document{-> EXEC(Modifier)};
+]]></programlisting>
+
+    <para>
+      To get to the modified view of an input document
+      <quote>file1.txt</quote>
+      open the output document
+      <quote>file1.txt.xmi</quote>
+      .
+      In editor do right-click and choose
+      <quote>CAS Views &rarr;
+        modified
+      </quote>
+      .
+    </para>
+  </section>
 </chapter>
\ No newline at end of file

Modified: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml?rev=1415605&r1=1415604&r2=1415605&view=diff
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml (original)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.overview.xml Fri Nov 30 12:51:25 2012
@@ -33,6 +33,8 @@ under the License.
     <para>
       - rule-based script language
       - imperative execution
+      - rapid prototyping
+      - intuitive and flexible, fast developement
       - extensive tooling support, writing rules is tedious, needs support
       - all about UIMA, AE, TypeSystem... 
     </para>
@@ -48,6 +50,7 @@ under the License.
       - read about the core concepts of TextMarker and take a look at the language examples
       - read the chapter about language for a precise description of the language, rather a reference book
       - workbench chapter for documentation of eclipse-based tooling
+      - example project in SVN
     </para>
   </section>
   
@@ -77,20 +80,20 @@ under the License.
       annotations added by the default seeding of the TextMarker Analysis Engine. There meaning is explained along with the examples. 
     </para>
     <note><para>
-      The examples in this section are not valid script files as they missing at least a package declaration. 
+      The examples in this section are not valid script files as they are missing at least a package declaration. 
       In order to obtain a valid script file, please ensure that all used types are imported or declared and 
       that a package declaration like <quote>PACKAGE uima.textmarker.example;</quote> is added in the first line of the script.
     </para></note>
     <para>
-      The first example consists of a declaration of a type followed by a simple rule. Type declaration always start with the keyword 
+      The first example consists of a declaration of a type followed by a simple rule. Type declaration always starts with the keyword 
       <quote>DECLARE</quote> followed by the short name of the new type. The namespace of the type is equal to the package declaration of the script file.
       There is also the possibility to create more complex types with features or specific parent types, but this will be neglected for now.
       In the example, a simple annotation type with the short name <quote>Animal</quote> is defined.
       After the declaration of the type, a rule with one rule element is given. 
-      TextMarker rules in general  can consist of a sequence of rule elements. Simple rule elements themselves consist of four parts: A matching condition,
+      TextMarker rules in general can consist of a sequence of rule elements. Simple rule elements themselves consist of four parts: A matching condition,
       an optional quantifier, an optional list of conditions and an optional list of actions. The rule element in the 
       following example has a matching condition <quote>W</quote>, an annotation type standing for normal words. 
-      Statements like declarations and rule always end with a semicolon. 
+      Statements like declarations and rules always end with a semicolon. 
     </para>
     
     <programlisting><![CDATA[DECLARE Animal;
@@ -98,7 +101,7 @@ W{REGEXP("dog") -> MARK(Animal)};]]></pr
 
     <para>
       The rule element also contains one condition and one action, both surrounded by curly parentheses. In order to distinguish conditions from actions,
-      they are separated by the <quote>-></quote>. The condition <quote>REGEXP("dog")</quote> indicated that the matched 
+      they are separated by the <quote>-></quote>. The condition <quote>REGEXP("dog")</quote> indicates that the matched 
       word must match the regular expression <quote>dog</quote>. If the matching condition and the additional regular expression are fulfilled, then the action
       is executed, which creates a new annotation of the type <quote>Animal</quote> with the same offsets as the matched token.
       The default seeder does actually not add annotations of the type <quote>W</quote>, but annotations of the types <quote>SW</quote> and