You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@uima.apache.org by pk...@apache.org on 2011/12/20 16:32:17 UTC
svn commit: r1221318 - in
/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker: pom.xml
src/docbook/images/tools/tools.textmarker/symboltaxo.png
src/docbook/proxy-book.xml src/docbook/tools.textmarker.xml
Author: pkluegl
Date: Tue Dec 20 15:32:17 2011
New Revision: 1221318
URL: http://svn.apache.org/viewvc?rev=1221318&view=rev
Log:
UIMA-2285
converted to maven project
added a proxy book and old (out-dated) introduction for testing the maven build process
Added:
uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml
uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png (with props)
uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml
Modified:
uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml
Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml?rev=1221318&view=auto
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml (added)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/pom.xml Tue Dec 20 15:32:17 2011
@@ -0,0 +1,23 @@
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
+ <modelVersion>4.0.0</modelVersion>
+ <artifactId>uima-docbook-textmarker</artifactId>
+ <version>2.4.1-SNAPSHOT</version>
+ <packaging>pom</packaging>
+ <parent>
+ <groupId>org.apache.uima</groupId>
+ <artifactId>uimaj-parent</artifactId>
+ <version>2.4.1-SNAPSHOT</version>
+ <relativePath>../uimaj-parent/pom.xml</relativePath>
+ </parent>
+ <name>Apache UIMA SDK Documentation - TextMarker</name>
+ <url>${uimaWebsiteUrl}</url>
+ <scm>
+ <url>http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker</url>
+ <connection>scm:svn:http://svn.apache.org/repos/asf/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker</connection>
+ <developerConnection>scm:svn:https://svn.apache.org/repos/asf/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker</developerConnection>
+ </scm>
+ <properties>
+ <uimaScmProject>${project.artifactId}</uimaScmProject>
+ <bookNameRoot>proxy-book</bookNameRoot>
+ </properties>
+</project>
\ No newline at end of file
Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png?rev=1221318&view=auto
==============================================================================
Binary file - no diff available.
Propchange: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/images/tools/tools.textmarker/symboltaxo.png
------------------------------------------------------------------------------
svn:mime-type = application/octet-stream
Added: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml?rev=1221318&view=auto
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml (added)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/proxy-book.xml Tue Dec 20 15:32:17 2011
@@ -0,0 +1,27 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN"
+"http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd">
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+<book lang="en">
+ <title>TextMarker Guide and Reference</title>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="../../target/docbook-shared/common_book_info.xml"/>
+ <toc/>
+ <xi:include xmlns:xi="http://www.w3.org/2001/XInclude" href="tools.textmarker.xml"/>
+</book>
Modified: uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml
URL: http://svn.apache.org/viewvc/uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml?rev=1221318&r1=1221317&r2=1221318&view=diff
==============================================================================
--- uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml (original)
+++ uima/sandbox/trunk/TextMarker/uima-docbook-textmarker/src/docbook/tools.textmarker.xml Tue Dec 20 15:32:17 2011
@@ -25,4 +25,227 @@ under the License.
-->
<chapter id="ugr.tools.tm">
+ <title>TextMarker User's Guide</title>
+ <titleabbrev>TextMarker User's Guide</titleabbrev>
+
+ <section id="ugr.tools.tm.introduction">
+ <title>TextMarker</title>
+ <para>The TextMarker system is a rule-based tool for information
+ extraction and text processing tasks. The comprehensible rule
+ language
+ can be easily extended and supports several scripting
+ functionalities.
+ TextMarker provides a DLTK-based IDE, an integration
+ and a build
+ process for UIMA components.
+ </para>
+ <section id="ugr.tools.tm.introduction.metaphor">
+ <title>Introduction</title>
+ <para>
+ In manual information extraction humans often apply a strategy
+ according to a highlighter metaphor: First relevant headlines are
+ considered and classified according to their content by coloring
+ them
+ with different highlighters. The paragraphs of the annotated
+ headlines
+ are then considered further. Relevant text fragments or
+ single words
+ in the context of that headline can then be colored. In
+ this way, a
+ top-down analysis and extraction strategy is implemented.
+ Necessary
+ additional information can then be added that either refers
+ to other
+ text segments or contains valuable domain specific
+ information.
+ Finally the colored text can be easily analyzed
+ concerning the
+ relevant information.
+
+ The TextMarker system (textmarker
+ is a common german word for a
+ highlighter) tries to imitate this
+ manual extraction method by
+ formalizing the appropriate actions using
+ matching rules: The rules
+ mark sequences of words, extract text
+ segments or modify the input
+ document depending on textual
+ features.The default input for the
+ TextMarker system is
+ semi-structured text, but it can also process
+ structured or free
+ text. Technically, HTML is often the input
+ format,
+ since most word
+ processing documents can be converted to HTML.
+ Additionally, the
+ TextMarker systems offers the possibility to
+ create
+ a modified output
+ document.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.concepts">
+ <title>Core Concepts</title>
+ <para>
+ As a first step in the extraction process the TextMarker system uses
+ a
+ tokenizer (scanner) to tokenize the input document and to create a
+ stream of basic symbols. The types and valid annotations of the
+ possible tokens are predefined by a taxonomy of annotation types.
+ Annotations simply refer to a section of the input document and
+ assign a type or concept to the respective text fragment. The figure
+ on the right shows an excerpt of a basic annotation taxonomy: CW
+ describes all tokens, for example, that contains a single word
+ starting with a capital letter, MARKUP corresponds to HTML or XML
+ tags, and PM refers to all kinds of punctuations marks. Take a look
+ at [basic annotations|BasicAnnotationList] for a complete list of
+ initial annotations.
+
+
+ <screenshot>
+ <mediaobject>
+ <imageobject>
+ <imagedata scale="100" format="PNG" fileref="&imgroot;symboltaxo.png" />
+ </imageobject>
+ <textobject>
+ <phrase>Part of a taxonomy for basic annotation types.</phrase>
+ </textobject>
+ </mediaobject>
+ </screenshot>
+
+ By using (and extending) the taxonomy, the knowledge engineer is
+ able
+ to choose the most adequate types and concepts when defining new
+ matching rules, i.e., TextMarker rules for matching a text fragment
+ given by a set of symbols to an annotation. If the capitalization of
+ a word, for example, is of no importance, then the annotation type W
+ that describes words of any kind can be used. The initial scanner
+ creates a set of basic annotations that may be used by the matching
+ rules of the TextMarker language. However, most information
+ extraction applications require domain specific concepts and
+ annotations. Therefore, the knowledge engineer is able to extend the
+ set of annotations, and to define new annotation types tuned to the
+ requirements of the given domain. These types can be flexibly
+ integrated in the taxonomy of annotation types.
+
+ One of the goals in
+ developing a new information extraction language
+ was
+ to maintain an
+ easily readable syntax while still providing a
+ scalable
+ expressiveness of the language. Basically, the TextMarker
+ language
+ contains expressions for the definition of new annotation
+ types and
+ for defining new matching rules. The rules are defined by a
+ list of
+ rule elements.
+ Each rule element contains at least a basic matching
+ condition referring
+ to text fragments or already specified
+ annotations. Additionally a
+ list of conditions and actions may be
+ specified for a rule element.
+ Whereas the conditions describe
+ necessary attributes of the matched
+ text fragment, the actions point
+ to operations and assignments on
+ the
+ current fragments. These actions
+ will then only be executed if all
+ basic conditions matched on a text
+ fragment or the annotation and the
+ related conditions are fulfilled.
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.examples">
+ <title>Examples</title>
+ <para>
+ The usage of the language and its readability can be demonstrated by
+ simple examples:
+
+ <programlisting>
+ CW{INLIST('animals.txt') -> MARK(Animal)};
+ Animal "and" Animal{-> MARK(Animalpair, 1, 2, 3)};
+ </programlisting>
+
+ The first rule looks at all capitalized words that are listed in an
+ external document animals.txt and creates a new annotation of the
+ type
+ animal using the boundaries of the matched word. The second rule
+ searches for an annotation of the type animal followed by the
+ literal
+ and and a second animal annotation. Then it will create a new
+ annotation animalpair covering the text segment that matched the
+ three
+ rule elements (the digit parameters refer to the number of
+ matched
+ rule element).
+
+ <programlisting>
+ Document{-> MARKFAST(Firstname, 'firstnames.txt')};
+ Firstname CW{-> MARK(Lastname)};
+ Paragraph{VOTE(Firstname, Lastname) -> LOG("Found more Firstnames than Lastnames")};
+ </programlisting>
+
+ In this example, the first rule annotates all words that occur in
+ the
+ external document firstnames.txt with the type firstname. The
+ second
+ rule creates a lastname annotation for all capitalized word
+ that
+ follow a firstname annotation. The last rule finally processes
+ all
+ paragraph} annotations. If the VOTE condition counts more
+ firstname
+ than lastname annotations, then the rule writes a log entry
+ with a
+ predefined message.
+
+
+ <programlisting>
+ ANY+{PARTOF(Paragraph), CONTAINS(Delete, 50, 100, true) -> MARK(Delete)};
+ Firstname{-> MARK(Delete,1 , 2)} Lastname;
+ Delete{-> DEL};
+ </programlisting>
+
+ Here, the first rule looks for sequences of any kind of tokens
+ except
+ markup and creates one annotation of the type delete for each
+ sequence, if the tokens are part of a paragraph annotation and
+ contains together already more than 50% of delete annoations. The +
+ signs indicate this greedy processing. The second rule annotates
+ first
+ names followed by last names with the type delete and the third
+ rule
+ simply deletes all text segments that are associated with that
+ delete
+ annotation.
+
+ </para>
+ </section>
+ <section id="ugr.tools.tm.introduction.features">
+ <title>Special Features</title>
+ <para>
+ The TextMarker language features some special characteristics
+ that are
+ usually not found in other rule-based information extraction
+ systems
+ or even shift it towards scripting languages. The possibility
+ of
+ creating new annotation types and integrating them into the
+ taxonomy
+ facilitates an even more modular development of information
+ extraction systems.
+
+ Read more about robust extraction using
+ filtering, complex control
+ structures and heuristic extraction using
+ scoring rules.
+ </para>
+ </section>
+ </section>
</chapter>
\ No newline at end of file