You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2011/06/30 19:03:31 UTC

svn commit: r1141623 - in /incubator/stanbol/trunk: commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/ enhancer/ enhancer/engines/entitytagging/src/main/java/org/apache/stanbol/enhancer/engines/entitytagging/impl/ en...

Author: rwesten
Date: Thu Jun 30 17:03:30 2011
New Revision: 1141623

URL: http://svn.apache.org/viewvc?rev=1141623&view=rev
Log:
STANBOL-245: First version of the TaxonomyLinkingEngine

Parses the content provided by the ContentItem. Tokenizes it by using an OpenNLP Tokenizer. If a Sentence Detector is available is processes sentences. Otherwise the whole text is processed at once. If a POS tagger is available it performs only lookups for Nouns. Also creating chunks based on POS tags is already implemented. If a chunker is available is uses this instead of the POS tags.
Entity Lookup is implemented based on the ReferencedSite Interface. Lookups do not restrict by types, but the types of the results are used to determine the type of the extracted words.
This Engine has also experimental support for following of rdf:seeAlso links often used to represent redirects within RDF data.

open Issues:

* Language support: Currently "en" is hardcoded, but the implementation would already support it.
* Code cleanup: The opennlp-ner engine shares some functionality with this one. The same is true for the EntityTagging Engine. OpenNLP specific stuff could be moved to the commons.opennlp bundle. For the Entityhub related stuff that is useful for all engines using the Entityhub one need to find a place.
* Integration tests are missing
* Finding a taxonomy to demonstrate this Engine: IPTC can be used, but would be rather something for a DocumentCategorizationEngine than for this one. Domain specific thesaurus would be good candidates. Any Ideas?

other Changes

* the ContentItemResource by two additional Categories "Concpets" and "others" in addition to "Person" Organization" and "Places". Concepts is linked to skos:Concept and others are all TextAnnotations without any dc:type value
* added skis:Concept to the OntologicalClasses list because the TaxonomyEngine uses this as dc:type value in case a TextAnnotation stands for a Concept of a Taxonomy

Note:

* as long as the "org.apache.stanbol.defaultdata" bundle version 0.0.3 is not available this engine will not be able to use the english POS tagger nor chunker. As a workaround users can download this files form http://opennlp.sourceforge.net/models-1.5/ and copy them in the "{stanbolhome}/sling/datafiles" folder.
* this engine is not yet added to the full launcher. Just install the bundle to a running stanbol instance
* this engine does not provide a default configuration. users will need to create a configuration by setting at least the "Referenced Site" in the configuration tab of the Apache Felix Webconsole. As soon as this engine is included in the full launcher it will be pre configured with a default taxonomy provided by a referenced site.

Added:
    incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/black_gear_48.png   (with props)
    incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/unknown_48.png   (with props)
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/pom.xml   (with props)
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/EnhancementRDFUtils.java   (with props)
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/Suggestion.java   (with props)
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/TaxonomyLinkingEngine.java   (with props)
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/resources/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/resources/OSGI-INF/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/resources/OSGI-INF/metatype/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/test/
    incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/test/java/
Removed:
    incubator/stanbol/trunk/enhancer/engines/entitytagging/src/main/java/org/apache/stanbol/enhancer/engines/entitytagging/impl/LabelBasedEntityTaggingEngine.java
Modified:
    incubator/stanbol/trunk/enhancer/engines/entitytagging/src/main/resources/OSGI-INF/metatype/metatype.properties
    incubator/stanbol/trunk/enhancer/generic/servicesapi/src/main/java/org/apache/stanbol/enhancer/servicesapi/rdf/OntologicalClasses.java
    incubator/stanbol/trunk/enhancer/jersey/src/main/java/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource.java
    incubator/stanbol/trunk/enhancer/jersey/src/main/resources/org/apache/stanbol/enhancer/jersey/templates/imports/contentitem.ftl
    incubator/stanbol/trunk/enhancer/pom.xml

Added: incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/black_gear_48.png
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/black_gear_48.png?rev=1141623&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/black_gear_48.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Added: incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/unknown_48.png
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/unknown_48.png?rev=1141623&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/stanbol/trunk/commons/web/home/src/main/resources/org/apache/stanbol/commons/web/home/static/images/unknown_48.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: incubator/stanbol/trunk/enhancer/engines/entitytagging/src/main/resources/OSGI-INF/metatype/metatype.properties
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/entitytagging/src/main/resources/OSGI-INF/metatype/metatype.properties?rev=1141623&r1=1141622&r2=1141623&view=diff
==============================================================================
--- incubator/stanbol/trunk/enhancer/engines/entitytagging/src/main/resources/OSGI-INF/metatype/metatype.properties (original)
+++ incubator/stanbol/trunk/enhancer/engines/entitytagging/src/main/resources/OSGI-INF/metatype/metatype.properties Thu Jun 30 17:03:30 2011
@@ -1,7 +1,7 @@
 #===============================================================================
 #Properties and Options used to configure ReferencedSiteEntityTaggingEnhancementEngine
 #===============================================================================
-org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine.name=Named Entity Tagging Engine
+org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine.name=Apache Stanbol Enhancement Engine for Named Entity Tagging
 org.apache.stanbol.enhancer.engines.entitytagging.impl.NamedEntityTaggingEngine.description=Links named entities (Persons, Organisations, Places) to Entities managed by an Entityhub Referenced Site
 
 org.apache.stanbol.enhancer.engines.entitytagging.referencedSiteId.name=Referenced Site

Added: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/pom.xml
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/pom.xml?rev=1141623&view=auto
==============================================================================
--- incubator/stanbol/trunk/enhancer/engines/taxonomylinking/pom.xml (added)
+++ incubator/stanbol/trunk/enhancer/engines/taxonomylinking/pom.xml Thu Jun 30 17:03:30 2011
@@ -0,0 +1,147 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<!--
+   Licensed to the Apache Software Foundation (ASF) under one or more
+   contributor license agreements.  See the NOTICE file distributed with
+   this work for additional information regarding copyright ownership.
+   The ASF licenses this file to You under the Apache License, Version 2.0
+   (the "License"); you may not use this file except in compliance with
+   the License.  You may obtain a copy of the License at
+
+        http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
+-->
+<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
+         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/maven-v4_0_0.xsd">
+
+  <modelVersion>4.0.0</modelVersion>
+
+  <parent>
+    <groupId>org.apache.stanbol</groupId>
+    <artifactId>org.apache.stanbol.enhancer.parent</artifactId>
+    <version>0.9-SNAPSHOT</version>
+    <relativePath>../../parent</relativePath>
+  </parent>
+
+  <groupId>org.apache.stanbol</groupId>
+  <artifactId>org.apache.stanbol.enhancer.engine.taxonomy</artifactId>
+  <packaging>bundle</packaging>
+
+  <name>Apache Stanbol Enhancer Enhancement Engine : Taxonomy Linking Engine </name>
+  <description>
+    Implementation of an annotation engine that uses a referenced site of the
+    Entityhub as Taxonomy for searching Entities mentioned within the parsed 
+    ContentItem.
+    This engine uses OpenNLP Tokenizers and optionally POS tagger and Chunker 
+    to extracts words that may represent Entities part of the Taxonomy.
+    NOTE: This engine expects that used Referenced Sites hold an Taxonomy. When
+    used with Datasets such as Wikipedia it will deliver a lot Enhancements. In
+    such cases users might want to apply additional filters on results.
+  </description>
+
+  <inceptionYear>2010</inceptionYear>
+
+  <scm>
+    <connection>
+      scm:svn:http://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomy/
+    </connection>
+    <developerConnection>
+      scm:svn:https://svn.apache.org/repos/asf/incubator/stanbol/trunk/enhancer/engines/taxonomy/
+    </developerConnection>
+    <url>http://incubator.apache.org/stanbol/</url>
+  </scm>
+
+  <build>
+    <plugins>
+      <plugin>
+        <groupId>org.apache.felix</groupId>
+        <artifactId>maven-bundle-plugin</artifactId>
+        <extensions>true</extensions>
+        <configuration>
+          <instructions>
+            <Export-Package>
+              org.apache.stanbol.enhancer.engines.taxonomy;version=${pom.version}
+            </Export-Package>
+            <Private-Package>
+              org.apache.stanbol.enhancer.engines.taxonomy.impl.*
+            </Private-Package>
+            <!-- TODO those should be bundles! -->
+            <Embed-Dependency>
+            </Embed-Dependency>
+          </instructions>
+        </configuration>
+      </plugin>
+      <plugin>
+        <groupId>org.apache.felix</groupId>
+        <artifactId>maven-scr-plugin</artifactId>
+      </plugin>
+    </plugins>
+  </build>
+
+  <dependencies>
+    <dependency>
+      <groupId>org.apache.stanbol</groupId>
+      <artifactId>org.apache.stanbol.enhancer.servicesapi</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.stanbol</groupId>
+      <artifactId>org.apache.stanbol.commons.stanboltools.offline</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.stanbol</groupId>
+      <artifactId>org.apache.stanbol.entityhub.servicesapi</artifactId>
+      <scope>compile</scope>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.stanbol</groupId>
+      <artifactId>org.apache.stanbol.entityhub.model.clerezza</artifactId>
+      <scope>compile</scope>
+    </dependency>
+
+    <dependency>
+	  <groupId>org.apache.stanbol</groupId>
+	  <artifactId>org.apache.stanbol.commons.opennlp</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.stanbol</groupId>
+      <artifactId>org.apache.stanbol.defaultdata</artifactId>
+    </dependency>
+
+    <dependency>
+      <groupId>commons-io</groupId>
+      <artifactId>commons-io</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>commons-lang</groupId>
+      <artifactId>commons-lang</artifactId>
+    </dependency>
+
+    <dependency>
+      <groupId>org.apache.felix</groupId>
+      <artifactId>org.apache.felix.scr.annotations</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.apache.clerezza</groupId>
+      <artifactId>org.apache.clerezza.rdf.core</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.slf4j</groupId>
+      <artifactId>slf4j-api</artifactId>
+    </dependency>
+
+    <!-- Testing -->
+    <dependency>
+      <groupId>junit</groupId>
+      <artifactId>junit</artifactId>
+    </dependency>
+    <dependency>
+      <groupId>org.slf4j</groupId>
+      <artifactId>slf4j-simple</artifactId>
+    </dependency>
+  </dependencies>
+
+</project>

Propchange: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/pom.xml
------------------------------------------------------------------------------
    svn:mime-type = text/plain

Added: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/EnhancementRDFUtils.java
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/EnhancementRDFUtils.java?rev=1141623&view=auto
==============================================================================
--- incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/EnhancementRDFUtils.java (added)
+++ incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/EnhancementRDFUtils.java Thu Jun 30 17:03:30 2011
@@ -0,0 +1,140 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *     http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.stanbol.enhancer.engines.taxonomy.impl;
+
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.DC_RELATION;
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.ENHANCER_CONFIDENCE;
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.ENHANCER_ENTITY_LABEL;
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.ENHANCER_ENTITY_REFERENCE;
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.ENHANCER_ENTITY_TYPE;
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.RDF_TYPE;
+
+import java.util.Collection;
+import java.util.Iterator;
+
+import org.apache.clerezza.rdf.core.Language;
+import org.apache.clerezza.rdf.core.Literal;
+import org.apache.clerezza.rdf.core.LiteralFactory;
+import org.apache.clerezza.rdf.core.MGraph;
+import org.apache.clerezza.rdf.core.NonLiteral;
+import org.apache.clerezza.rdf.core.UriRef;
+import org.apache.clerezza.rdf.core.impl.PlainLiteralImpl;
+import org.apache.clerezza.rdf.core.impl.TripleImpl;
+import org.apache.stanbol.enhancer.servicesapi.ContentItem;
+import org.apache.stanbol.enhancer.servicesapi.EnhancementEngine;
+import org.apache.stanbol.enhancer.servicesapi.helper.EnhancementEngineHelper;
+import org.apache.stanbol.entityhub.servicesapi.model.Reference;
+import org.apache.stanbol.entityhub.servicesapi.model.Entity;
+import org.apache.stanbol.entityhub.servicesapi.model.Representation;
+import org.apache.stanbol.entityhub.servicesapi.model.Text;
+import org.apache.stanbol.entityhub.servicesapi.model.rdf.RdfResourceEnum;
+
+/**
+ * Utility taken form the engine.autotagging bundle and adapted from using TagInfo to {@link Entity}.
+ * 
+ * @author Rupert Westenthaler
+ * @author ogrisel (original utility)
+ */
+public class EnhancementRDFUtils {
+
+    /**
+     * @param literalFactory
+     *            the LiteralFactory to use
+     * @param graph
+     *            the MGraph to use
+     * @param contentItemId
+     *            the contentItemId the enhancement is extracted from
+     * @param relatedEnhancements
+     *            enhancements this textAnnotation is related to
+     * @param entity
+     *            the related entity
+     * @param nameField the field used to extract the name
+     */
+    public static UriRef writeEntityAnnotation(EnhancementEngine engine,
+                                               LiteralFactory literalFactory,
+                                               ContentItem ci,
+                                               Collection<? extends NonLiteral> relatedEnhancements,
+                                               Representation representation,
+                                               String nameField,
+                                               String language) {
+        MGraph graph = ci.getMetadata();
+        UriRef contentItemId = new UriRef(ci.getId());
+        // 1. check if the returned Entity does has a label -> if not return null
+        // add labels (set only a single label. Use "en" if available!
+        Text label = null;
+        Iterator<Text> labels = representation.getText(nameField);
+        while (labels.hasNext()) {
+            Text actLabel = labels.next();
+            if (label == null) {
+                label = actLabel;
+            } else if(language != null){
+                String actLang = actLabel.getLanguage();
+                if(actLang != null && actLang.startsWith(language)) {
+                    label = actLabel;
+                }
+            }
+        }
+        if (label == null) {
+            return null;
+        }
+        Literal literal;
+        if (label.getLanguage() == null) {
+            literal = new PlainLiteralImpl(label.getText());
+        } else {
+            literal = new PlainLiteralImpl(label.getText(), new Language(label.getLanguage()));
+        }
+        // Now create the entityAnnotation
+        UriRef entityAnnotation = EnhancementEngineHelper.createEntityEnhancement(graph, engine,
+            contentItemId);
+        // first relate this entity annotation to the text annotation(s)
+        for (NonLiteral enhancement : relatedEnhancements) {
+            graph.add(new TripleImpl(entityAnnotation, DC_RELATION, enhancement));
+        }
+        UriRef entityUri = new UriRef(representation.getId());
+        // add the link to the referred entity
+        graph.add(new TripleImpl(entityAnnotation, ENHANCER_ENTITY_REFERENCE, entityUri));
+        // add the label parsed above
+        graph.add(new TripleImpl(entityAnnotation, ENHANCER_ENTITY_LABEL, literal));
+        // TODO: add real confidence values!
+        // -> in case of SolrYards this will be a Lucene score and not within the range [0..1]
+        // -> in case of SPARQL there will be no score information at all.
+        Object score = representation.getFirst(RdfResourceEnum.resultScore.getUri());
+        Double scoreValue = new Double(-1); // use -1 if no score is available!
+        if (score != null) {
+            try {
+                scoreValue = Double.valueOf(score.toString());
+            } catch (NumberFormatException e) {
+                // ignore
+            }
+        }
+        graph.add(new TripleImpl(entityAnnotation, ENHANCER_CONFIDENCE, literalFactory
+                .createTypedLiteral(scoreValue)));
+
+        Iterator<Reference> types = representation.getReferences(RDF_TYPE.getUnicodeString());
+        while (types.hasNext()) {
+            graph.add(new TripleImpl(entityAnnotation, ENHANCER_ENTITY_TYPE, new UriRef(types.next()
+                    .getReference())));
+        }
+        // TODO: for now add the information about this entity to the graph
+        // -> this might be replaced by some additional engine at the end
+        // RdfValueFactory rdfValueFactory = RdfValueFactory.getInstance();
+        // RdfRepresentation representation = rdfValueFactory.toRdfRepresentation(entity.getRepresentation());
+        // graph.addAll(representation.getRdfGraph());
+        return entityAnnotation;
+    }
+
+}

Propchange: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/EnhancementRDFUtils.java
------------------------------------------------------------------------------
    svn:mime-type = text/plain

Added: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/Suggestion.java
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/Suggestion.java?rev=1141623&view=auto
==============================================================================
--- incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/Suggestion.java (added)
+++ incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/Suggestion.java Thu Jun 30 17:03:30 2011
@@ -0,0 +1,137 @@
+package org.apache.stanbol.enhancer.engines.taxonomy.impl;
+
+import java.util.ArrayList;
+import java.util.Collections;
+import java.util.HashSet;
+import java.util.List;
+import java.util.Set;
+
+import org.apache.clerezza.rdf.core.UriRef;
+import org.apache.stanbol.enhancer.servicesapi.rdf.OntologicalClasses;
+import org.apache.stanbol.enhancer.servicesapi.rdf.Properties;
+import org.apache.stanbol.entityhub.servicesapi.model.Representation;
+import org.apache.stanbol.entityhub.servicesapi.site.ReferencedSite;
+
+/**
+ * Holds information about suggestions created by the {@link TaxonomyLinkingEngine}.
+ * used to perform local lookups for queries that would be normally executed
+ * on the {@link ReferencedSite}
+ * @author Rupert Westenthaler
+ *
+ */
+class Suggestion implements Comparable<Suggestion>{
+    
+    private final UriRef textAnnotation;
+    
+    private final Set<UriRef> textAnnotationTypes;
+    
+    private final Set<UriRef> linkedTextAnnotations;
+    private final Set<UriRef> unmodLinked;
+    
+    private final String searchString;
+
+    private final List<Representation> suggestions;
+    
+    public Suggestion(String searchString,UriRef textAnnotation,List<Representation> suggestions,Set<UriRef> textAnnotationTypes){
+        if(searchString == null || searchString.isEmpty()){
+            throw new IllegalArgumentException("The search string MUST NOT be NULL nor emtpy");
+        }
+        if(textAnnotation == null){
+            throw new IllegalArgumentException("The parsed UriRef of the textAnnotation MUST NOT be NULL nor empty");
+        }
+        if(suggestions == null || suggestions.isEmpty()){
+            throw new IllegalArgumentException("The parsed list of suggestions MUST NOT be NULL nor empty");
+        }
+        if(suggestions.contains(null)){
+            //test for NULL element, because this will cause NPE later on that would
+            //be hard to debug!
+            throw new IllegalArgumentException("The parsed list of suggestions MUST NOT contain the NULL element");
+        }
+        this.searchString = searchString;
+        this.textAnnotation = textAnnotation;
+        this.suggestions = Collections.unmodifiableList(new ArrayList<Representation>(suggestions));
+        this.linkedTextAnnotations = new HashSet<UriRef>();
+        this.unmodLinked = Collections.unmodifiableSet(linkedTextAnnotations);
+        if(textAnnotationTypes == null) {
+            this.textAnnotationTypes = Collections.emptySet();
+        } else {
+            this.textAnnotationTypes = Collections.unmodifiableSet(new HashSet<UriRef>(textAnnotationTypes));
+        }
+    }
+    
+    public final UriRef getTextAnnotation() {
+        return textAnnotation;
+    }
+
+    /**
+     * Returns an unmodifiable set containing all the other Text annotations
+     * for the same {@link #getSearchString() search string}.
+     * @return the linked text annotations (read only)
+     */
+    public final Set<UriRef> getLinkedTextAnnotations() {
+        return unmodLinked;
+    }
+    
+    public final boolean addLinked(UriRef textAnnotation){
+        if(textAnnotation != null){
+            return linkedTextAnnotations.add(textAnnotation);
+        } else {
+            return false;
+        }
+    }
+    public final boolean removeLinked(UriRef textAnnotation){
+        return linkedTextAnnotations.remove(textAnnotation);
+    }
+    /**
+     * Getter for the search string used to calculate the suggestions
+     * @return the search string
+     */
+    public final String getSearchString() {
+        return searchString;
+    }
+
+    /**
+     * Getter for the Representations suggested for the 
+     * {@link #getSearchString() search string}
+     * @return the suggestions (read only)
+     */
+    public final List<Representation> getSuggestions() {
+        return suggestions;
+    }
+    
+    @Override
+    public int hashCode() {
+        return searchString.hashCode();
+    }
+    
+    @Override
+    public boolean equals(Object o) {
+        return o instanceof Suggestion && ((Suggestion)o).searchString.equals(searchString);
+    }
+    
+    @Override
+    public int compareTo(Suggestion o) {
+        return searchString.compareTo(o.searchString);
+    }
+
+    /**
+     * Getter for the values of the {@link Properties#DC_TYPE dc:type}  property of the
+     * TextAnnotation. This types need to be used
+     * for additional TextAnnotations linked to the one returned by
+     * {@link #getTextAnnotation()}
+     * @return the @link Properties#DC_TYPE dc:type} values of the
+     * {@link #getTextAnnotation() TextAnnotation}.
+     */
+    public Set<UriRef> getTextAnnotationTypes() {
+        return textAnnotationTypes;
+    }
+    @Override
+    public String toString() {
+        List<String> suggestedIds = new ArrayList<String>(suggestions.size());
+        for(Representation rep : suggestions){
+            suggestedIds.add(rep == null ? null : rep.getId());
+        }
+        return String.format("Suggestion: %s -> %s",
+            searchString,suggestedIds);
+    }
+}

Propchange: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/Suggestion.java
------------------------------------------------------------------------------
    svn:mime-type = text/plain

Added: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/TaxonomyLinkingEngine.java
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/TaxonomyLinkingEngine.java?rev=1141623&view=auto
==============================================================================
--- incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/TaxonomyLinkingEngine.java (added)
+++ incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/TaxonomyLinkingEngine.java Thu Jun 30 17:03:30 2011
@@ -0,0 +1,940 @@
+package org.apache.stanbol.enhancer.engines.taxonomy.impl;
+
+import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.NIE_PLAINTEXTCONTENT;
+
+import java.io.IOException;
+import java.util.ArrayList;
+import java.util.Arrays;
+import java.util.Collection;
+import java.util.Collections;
+import java.util.Dictionary;
+import java.util.HashMap;
+import java.util.HashSet;
+import java.util.Iterator;
+import java.util.List;
+import java.util.Map;
+import java.util.Set;
+import java.util.TreeMap;
+import java.util.TreeSet;
+
+import opennlp.tools.chunker.Chunker;
+import opennlp.tools.chunker.ChunkerME;
+import opennlp.tools.chunker.ChunkerModel;
+import opennlp.tools.postag.POSModel;
+import opennlp.tools.postag.POSTagger;
+import opennlp.tools.postag.POSTaggerME;
+import opennlp.tools.sentdetect.SentenceDetector;
+import opennlp.tools.sentdetect.SentenceDetectorME;
+import opennlp.tools.sentdetect.SentenceModel;
+import opennlp.tools.tokenize.SimpleTokenizer;
+import opennlp.tools.tokenize.Tokenizer;
+import opennlp.tools.tokenize.TokenizerME;
+import opennlp.tools.util.InvalidFormatException;
+import opennlp.tools.util.Span;
+
+import org.apache.clerezza.rdf.core.LiteralFactory;
+import org.apache.clerezza.rdf.core.MGraph;
+import org.apache.clerezza.rdf.core.NonLiteral;
+import org.apache.clerezza.rdf.core.Triple;
+import org.apache.clerezza.rdf.core.UriRef;
+import org.apache.clerezza.rdf.core.impl.PlainLiteralImpl;
+import org.apache.clerezza.rdf.core.impl.TripleImpl;
+import org.apache.commons.io.IOUtils;
+import org.apache.commons.lang.StringUtils;
+import org.apache.felix.scr.annotations.Activate;
+import org.apache.felix.scr.annotations.Component;
+import org.apache.felix.scr.annotations.ConfigurationPolicy;
+import org.apache.felix.scr.annotations.Deactivate;
+import org.apache.felix.scr.annotations.Property;
+import org.apache.stanbol.entityhub.servicesapi.model.Entity;
+import org.apache.stanbol.entityhub.servicesapi.model.Reference;
+import org.apache.stanbol.entityhub.servicesapi.model.Text;
+import org.apache.felix.scr.annotations.ReferenceCardinality;
+import org.apache.felix.scr.annotations.ReferencePolicy;
+import org.apache.felix.scr.annotations.ReferenceStrategy;
+import org.apache.felix.scr.annotations.Service;
+import org.apache.stanbol.commons.opennlp.OpenNLP;
+import org.apache.stanbol.commons.stanboltools.offline.OfflineMode;
+import org.apache.stanbol.enhancer.servicesapi.ContentItem;
+import org.apache.stanbol.enhancer.servicesapi.EngineException;
+import org.apache.stanbol.enhancer.servicesapi.EnhancementEngine;
+import org.apache.stanbol.enhancer.servicesapi.InvalidContentException;
+import org.apache.stanbol.enhancer.servicesapi.ServiceProperties;
+import org.apache.stanbol.enhancer.servicesapi.helper.EnhancementEngineHelper;
+import org.apache.stanbol.enhancer.servicesapi.rdf.OntologicalClasses;
+import org.apache.stanbol.enhancer.servicesapi.rdf.Properties;
+import org.apache.stanbol.entityhub.servicesapi.Entityhub;
+import org.apache.stanbol.entityhub.servicesapi.EntityhubException;
+import org.apache.stanbol.entityhub.servicesapi.defaults.NamespaceEnum;
+import org.apache.stanbol.entityhub.servicesapi.model.Representation;
+import org.apache.stanbol.entityhub.servicesapi.query.FieldQuery;
+import org.apache.stanbol.entityhub.servicesapi.query.QueryResultList;
+import org.apache.stanbol.entityhub.servicesapi.query.TextConstraint;
+import org.apache.stanbol.entityhub.servicesapi.site.ReferencedSite;
+import org.apache.stanbol.entityhub.servicesapi.site.ReferencedSiteException;
+import org.apache.stanbol.entityhub.servicesapi.site.ReferencedSiteManager;
+import org.apache.stanbol.entityhub.servicesapi.util.ModelUtils;
+//removed annotations until engine actually does something
+//@Component(configurationFactory = true, policy = ConfigurationPolicy.REQUIRE, // the baseUri is required!
+//    specVersion = "1.1", metatype = true, immediate = true)
+//@Service
+import org.osgi.service.cm.ConfigurationException;
+import org.osgi.service.component.ComponentContext;
+import org.slf4j.Logger;
+import org.slf4j.LoggerFactory;
+import org.w3c.dom.NameList;
+
+@Component(configurationFactory = true, policy = ConfigurationPolicy.REQUIRE, // the baseUri is required!
+    specVersion = "1.1", metatype = true, immediate = true)
+@Service
+public class TaxonomyLinkingEngine implements EnhancementEngine, ServiceProperties {
+
+    private static Logger log = LoggerFactory.getLogger(TaxonomyLinkingEngine.class);
+
+    private static final boolean DEFAULT_SIMPLE_TOKENIZER_STATE = true;
+    private static final int DEFAULT_MIN_SEARCH_TOKEN_LENGTH = 3;
+    private static final boolean DEFAULT_USE_CHUNKER_STATE = false;
+    private static final String DEFAULT_NAME_FIELD = "rdfs:label";
+    /**
+     * The default number for the maximum number of terms suggested for a word
+     */
+    private static final int DEFAULT_SUGGESTIONS = 3;
+    /**
+     * Default value for the number of tokens that must be contained in
+     * suggested terms.
+     */
+    private static final int DEFAULT_MIN_FOUND_TOKENS = 2;
+    @Property
+    public static final String REFERENCED_SITE_ID = "org.apache.stanbol.enhancer.engines.taxonomy.referencedSiteId";
+    @Property(value = DEFAULT_NAME_FIELD)
+    public static final String NAME_FIELD = "org.apache.stanbol.enhancer.engines.taxonomy.nameField";
+    @Property(boolValue=DEFAULT_SIMPLE_TOKENIZER_STATE)
+    public static final String SIMPLE_TOKENIZER = "org.apache.stanbol.enhancer.engines.taxonomy.simpleTokenizer";
+    @Property(intValue=DEFAULT_MIN_SEARCH_TOKEN_LENGTH)
+    public static final String MIN_SEARCH_TOKEN_LENGTH = "org.apache.stanbol.enhancer.engines.taxonomy.minSearchTokenLength";
+    @Property(boolValue=DEFAULT_USE_CHUNKER_STATE)
+    public static final String ENABLE_CHUNKER = "org.apache.stanbol.enhancer.engines.taxonomy.enableChunker";
+    @Property(intValue=DEFAULT_SUGGESTIONS)
+    public static final String MAX_SUGGESTIONS = "org.apache.stanbol.enhancer.engines.taxonomy.maxSuggestions";
+    @Property(intValue=DEFAULT_MIN_FOUND_TOKENS)
+    public static final String MIN_FOUND_TOKENS= "org.apache.stanbol.enhancer.engines.taxonomy.minFoundTokens";
+    
+    protected static final String TEXT_PLAIN_MIMETYPE = "text/plain";
+    /**
+     * The default value for the Execution of this Engine. Currently set to
+     * {@link ServiceProperties#ORDERING_EXTRACTION_ENHANCEMENT} + 10. It should run after Metaxa and LangId.
+     */
+    public static final Integer defaultOrder = ServiceProperties.ORDERING_EXTRACTION_ENHANCEMENT + 10;
+
+    public static final Map<String,UriRef> DEFAULT_ENTITY_TYPE_MAPPINGS;
+    static { //the default mappings for the three types used by the Stanbol Enhancement Structure
+        Map<String,UriRef> mappings = new HashMap<String,UriRef>();
+        mappings.put(OntologicalClasses.DBPEDIA_ORGANISATION.getUnicodeString(), OntologicalClasses.DBPEDIA_ORGANISATION);
+        mappings.put(NamespaceEnum.dbpediaOnt+"Newspaper", OntologicalClasses.DBPEDIA_ORGANISATION);
+        mappings.put(NamespaceEnum.schema+"Organization", OntologicalClasses.DBPEDIA_ORGANISATION);
+        
+        mappings.put(OntologicalClasses.DBPEDIA_PERSON.getUnicodeString(), OntologicalClasses.DBPEDIA_PERSON);
+        mappings.put(NamespaceEnum.foaf+"Person", OntologicalClasses.DBPEDIA_PERSON);
+        mappings.put(NamespaceEnum.schema+"Person", OntologicalClasses.DBPEDIA_PERSON);
+
+        mappings.put(OntologicalClasses.DBPEDIA_PLACE.getUnicodeString(), OntologicalClasses.DBPEDIA_PLACE);
+        mappings.put(NamespaceEnum.schema+"Place", OntologicalClasses.DBPEDIA_PLACE);
+
+        mappings.put(OntologicalClasses.SKOS_CONCEPT.getUnicodeString(), OntologicalClasses.SKOS_CONCEPT);
+        DEFAULT_ENTITY_TYPE_MAPPINGS = Collections.unmodifiableMap(mappings);
+    }
+    
+    @org.apache.felix.scr.annotations.Reference
+    private OpenNLP openNLP;
+
+    /**
+     * Allow to force the use of the {@link SimpleTokenizer}
+     */
+    private boolean useSimpleTokenizer = DEFAULT_SIMPLE_TOKENIZER_STATE;
+
+    /**
+     * The minimum length of labels that are looked-up in the directory
+     */
+    private int minSearchTokenLength = DEFAULT_MIN_SEARCH_TOKEN_LENGTH;
+
+    /**
+     * Allows to activate/deactivate the use of an {@link Chunker}
+     */
+    private boolean useChunker = DEFAULT_USE_CHUNKER_STATE;
+
+    /**
+     * The field used to search for the names of entities part of the dictionary
+     */
+    private String nameField = NamespaceEnum.getFullName(DEFAULT_NAME_FIELD);
+    /**
+     * The the maximum number of terms suggested for a word
+     */
+    private int maxSuggestions = DEFAULT_SUGGESTIONS;
+    /**
+     * If several words are selected from the text to search for an Entity in the
+     * Dictionary (e.g. if a {@link Chunker} is used or if the {@link POSTagger}
+     * detects several connected nouns) that entities found for the such chunks
+     * MUST define a label (with no or the correct lanugage) that contains at
+     * least this number of tokens to be accepted.<p>
+     * TODO: make configurable
+     */
+    private int minFoundTokens = DEFAULT_MIN_FOUND_TOKENS;
+    
+    /**
+     * Service of the Entityhub that manages all the active referenced Site. This Service is used to lookup the
+     * configured Referenced Site when we need to enhance a content item.
+     */
+    @org.apache.felix.scr.annotations.Reference
+    protected ReferencedSiteManager siteManager;
+
+    /**
+     * Used to lookup Entities if the {@link #REFERENCED_SITE_ID} property is
+     * set to "entityhub" or "local"
+     */
+    @org.apache.felix.scr.annotations.Reference
+    protected Entityhub entityhub;
+    
+    /**
+     * This holds the id of the {@link ReferencedSite} used to lookup Entities
+     * or <code>null</code> if the {@link Entityhub} is used. 
+     */
+    protected String referencedSiteID;
+
+    /**
+     * Default constructor used by OSGI
+     */
+    public TaxonomyLinkingEngine(){}
+    
+    /**
+     * The RDF LiteralFactory used for typed literals
+     */
+    private LiteralFactory literalFactory = LiteralFactory.getInstance();
+    
+    /**
+     * Constructor used for unit tests outside of an OSGI environment
+     * @param openNLP
+     */
+    protected TaxonomyLinkingEngine(OpenNLP openNLP){
+        if(openNLP == null){
+            throw new IllegalArgumentException("The parsed OpenNLP instance MUST NOT be NULL");
+        }
+        this.openNLP = openNLP;
+    }
+    @SuppressWarnings("unchecked")
+    @Activate
+    protected void activate(ComponentContext context) throws ConfigurationException {
+        Dictionary<String,Object> config = context.getProperties();
+        //lookup the referenced site used as dictionary
+        Object referencedSiteID = config.get(REFERENCED_SITE_ID);
+        if (referencedSiteID == null) {
+            throw new ConfigurationException(REFERENCED_SITE_ID,
+                    "The ID of the Referenced Site is a required Parameter and MUST NOT be NULL!");
+        }
+
+        this.referencedSiteID = referencedSiteID.toString();
+        if (this.referencedSiteID.isEmpty()) {
+            throw new ConfigurationException(REFERENCED_SITE_ID,
+                    "The ID of the Referenced Site is a required Parameter and MUST NOT be an empty String!");
+        }
+        if(Entityhub.ENTITYHUB_IDS.contains(this.referencedSiteID.toLowerCase())){
+            log.info("Init NamedEntityTaggingEngine instance for the Entityhub");
+            this.referencedSiteID = null;
+        }
+        //parse the other configurations
+        Object value = config.get(ENABLE_CHUNKER);
+        if(value instanceof Boolean){
+            useChunker = ((Boolean)value).booleanValue();
+        } else if(value != null){
+            useChunker = Boolean.parseBoolean(value.toString());
+        }
+        value = config.get(MIN_SEARCH_TOKEN_LENGTH);
+        if(value instanceof Number){
+            minSearchTokenLength = ((Number)value).intValue();
+        } else if(value != null){
+            try {
+                minSearchTokenLength = Integer.parseInt(value.toString());
+            }catch (NumberFormatException e) {
+                log.warn("Unable to parse value for the minimum search token length." +
+                		"Use the default value "+minSearchTokenLength,e);
+            }
+        }
+        value = config.get(SIMPLE_TOKENIZER);
+        if(value instanceof Boolean){
+            useSimpleTokenizer = ((Boolean)value).booleanValue();
+        } else if(value != null){
+            useSimpleTokenizer = Boolean.parseBoolean(value.toString());
+        }
+        value = config.get(NAME_FIELD);
+        if(value != null && !value.toString().isEmpty()){
+            this.nameField = NamespaceEnum.getFullName(value.toString());
+        }
+
+    }
+    @Deactivate
+    protected void deactivate(ComponentContext context) {
+        referencedSiteID = null;
+        //reset optional properties to the default
+        nameField = DEFAULT_NAME_FIELD;
+        useChunker = DEFAULT_USE_CHUNKER_STATE;
+        minSearchTokenLength = DEFAULT_MIN_SEARCH_TOKEN_LENGTH;
+        useSimpleTokenizer = DEFAULT_SIMPLE_TOKENIZER_STATE;
+        minFoundTokens = DEFAULT_MIN_FOUND_TOKENS;
+        maxSuggestions = DEFAULT_SUGGESTIONS;
+    }
+    
+    /**
+     * The {@link OfflineMode} is used by Stanbol to indicate that no external service should be referenced.
+     * For this engine that means it is necessary to check if the used {@link ReferencedSite} can operate
+     * offline or not.
+     * 
+     * @see #enableOfflineMode(OfflineMode)
+     * @see #disableOfflineMode(OfflineMode)
+     */
+    @org.apache.felix.scr.annotations.Reference(
+        cardinality = ReferenceCardinality.OPTIONAL_UNARY, 
+        policy = ReferencePolicy.DYNAMIC, 
+        bind = "enableOfflineMode", 
+        unbind = "disableOfflineMode", 
+        strategy = ReferenceStrategy.EVENT)
+    private OfflineMode offlineMode;
+
+    private RedirectProcessingState redirectState = RedirectProcessingState.FOLLOW;
+
+    /**
+     * Called by the ConfigurationAdmin to bind the {@link #offlineMode} if the service becomes available
+     * 
+     * @param mode
+     */
+    protected final void enableOfflineMode(OfflineMode mode) {
+        this.offlineMode = mode;
+    }
+
+    /**
+     * Called by the ConfigurationAdmin to unbind the {@link #offlineMode} if the service becomes unavailable
+     * 
+     * @param mode
+     */
+    protected final void disableOfflineMode(OfflineMode mode) {
+        this.offlineMode = null;
+    }
+
+    /**
+     * Returns <code>true</code> only if Stanbol operates in {@link OfflineMode}.
+     * 
+     * @return the offline state
+     */
+    protected final boolean isOfflineMode() {
+        return offlineMode != null;
+    }
+    
+    @Override
+    public int canEnhance(ContentItem ci) throws EngineException {
+        String mimeType = ci.getMimeType().split(";", 2)[0];
+        if (TEXT_PLAIN_MIMETYPE.equalsIgnoreCase(mimeType)) {
+            return ENHANCE_SYNCHRONOUS;
+        }
+        // check for existence of textual content in metadata
+        UriRef subj = new UriRef(ci.getId());
+        Iterator<Triple> it = ci.getMetadata().filter(subj, NIE_PLAINTEXTCONTENT, null);
+        if (it.hasNext()) {
+            return ENHANCE_SYNCHRONOUS;
+        }
+        return CANNOT_ENHANCE;
+    }
+
+    @Override
+    public void computeEnhancements(ContentItem ci) throws EngineException {
+        final ReferencedSite site;
+        if(referencedSiteID != null) { //lookup the referenced site
+            site = siteManager.getReferencedSite(referencedSiteID);
+            //ensure that it is present
+            if (site == null) {
+                String msg = String.format(
+                    "Unable to enhance %s because Referenced Site %s is currently not active!", ci.getId(),
+                    referencedSiteID);
+                log.warn(msg);
+                // TODO: throwing Exceptions is currently deactivated. We need a more clear
+                // policy what do to in such situations
+                // throw new EngineException(msg);
+                return;
+            }
+            //and that it supports offline mode if required
+            if (isOfflineMode() && !site.supportsLocalMode()) {
+                log.warn("Unable to enhance ci {} because OfflineMode is not supported by ReferencedSite {}.",
+                    ci.getId(), site.getId());
+                return;
+            }
+        } else { // null indicates to use the Entityhub to lookup Entities
+            site = null;
+        }
+        String mimeType = ci.getMimeType().split(";", 2)[0];
+        String text;
+        if (TEXT_PLAIN_MIMETYPE.equals(mimeType)) {
+            try {
+                text = IOUtils.toString(ci.getStream(),"UTF-8");
+            } catch (IOException e) {
+                throw new InvalidContentException(this, ci, e);
+            }
+        } else {
+            //TODO: change that as soon the Adapter Pattern is used for multiple
+            // mimetype support.
+            StringBuilder textBuilder = new StringBuilder();
+            Iterator<Triple> it = ci.getMetadata().filter(new UriRef(ci.getId()), NIE_PLAINTEXTCONTENT, null);
+            while (it.hasNext()) {
+                textBuilder.append(it.next().getObject());
+            }
+            text = textBuilder.toString();
+        }
+        if (text.trim().length() == 0) {
+            // TODO: make the length of the data a field of the ContentItem
+            // interface to be able to filter out empty items in the canEnhance
+            // method
+            log.warn("nothing to extract knowledge from in ContentItem {}", ci);
+            return;
+        }
+        //TODO: determin the language
+        String language = "en";
+        log.debug("computeEnhancements for ContentItem {} language {} text={}", 
+            new Object []{ci.getId(), language, StringUtils.abbreviate(text, 100)});
+        
+        //first get the models
+        Tokenizer tokenizer = initTokenizer(language);
+        SentenceDetector sentenceDetector = initSentence(language);
+        POSTaggerME posTagger;
+        if(sentenceDetector != null){ //sentence detection is requirement
+            posTagger = initTagger(language);
+        } else {
+            posTagger = null;
+        }
+        ChunkerME chunker;
+        if(posTagger != null && useChunker ){ //pos tags requirement
+            chunker = initChunker(language);
+        } else {
+            chunker = null;
+        }
+        Map<String,Suggestion> suggestionCache = new TreeMap<String,Suggestion>();
+        if(sentenceDetector != null){
+            //add dots for multiple line breaks
+            text = text.replaceAll("\\n\\n", ".\n");
+            Span[] sentenceSpans = sentenceDetector.sentPosDetect(text);
+            for (int i = 0; i < sentenceSpans.length; i++) {
+                String sentence = sentenceSpans[i].getCoveredText(text).toString();
+                Span[] tokenSpans = tokenizer.tokenizePos(sentence);
+                String[] tokens = getTokensForSpans(sentence, tokenSpans);
+                String[] pos;
+                double[] posProbs;
+                if(posTagger != null){
+                    pos = posTagger.tag(tokens);
+                    posProbs = posTagger.probs();
+                } else {
+                    pos = null;
+                    posProbs = null;
+                }
+                Span[] chunkSpans;
+                double[] chunkProps;
+                if(chunker != null){
+                    chunkSpans = chunker.chunkAsSpans(tokens, pos);
+                    chunkProps = chunker.probs();
+                } else {
+                    chunkSpans = null;
+                    chunkProps = null;
+                }
+                enhance(suggestionCache,site,ci,language, //the site, metadata and lang
+                    sentenceSpans[i].getStart(),sentence, //offset and sentence
+                    tokenSpans,tokens, //the tokens
+                    pos,posProbs, // the pos tags (might be null)
+                    chunkSpans,chunkProps); //the chunks (might be null)
+            }
+        } else {
+            Span[] tokenSpans = tokenizer.tokenizePos(text);
+            String[] tokens = getTokensForSpans(text, tokenSpans);
+            enhance(suggestionCache,site,ci,language,0,text,tokenSpans,tokens,
+                null,null,null,null);
+        }
+        //finally write the entity enhancements
+        this.wirteEntityEnhancements(suggestionCache, ci, nameField,language);
+    }
+
+    /**
+     * @param sentence
+     * @param tokenSpans
+     * @return
+     */
+    private String[] getTokensForSpans(String sentence, Span[] tokenSpans) {
+        String[] tokens = new String[tokenSpans.length];
+        for(int ti = 0; ti<tokenSpans.length;ti++) {
+            tokens[ti] = tokenSpans[ti].getCoveredText(sentence).toString();
+        }
+        return tokens;
+    }
+
+    private void enhance(Map<String,Suggestion> suggestionCache,
+                         ReferencedSite site,
+                         ContentItem ci,
+                         String language,
+                         int offset,
+                         String sentence,
+                         Span[] tokenSpans,
+                         String[] tokens,
+                         String[] pos,
+                         double[] posProbs,
+                         Span[] chunkSpans,
+                         double[] chunkProps) throws EngineException {
+        //Iterate over tokens. Note that a single iteration may consume multiple
+        //tokens in case a suggestion is found for a chunk.
+        int consumed = -1;
+        int chunkPos = 0;
+        for(int currentToken = 0; currentToken < tokenSpans.length;currentToken++){
+            Span current; //the current chunk to be processed
+            //in case POS tags are available process only tokens with
+            //specific types. If no POS tags are available process all tokens
+            if(pos == null || includePOS(pos[currentToken])){
+                //process this token
+                if(chunkSpans != null && chunkPos < chunkSpans.length){
+                    //consume unused chunks (chunks use token index as start/end)
+                    for(;chunkSpans[chunkPos].getEnd() < currentToken;chunkPos++);
+                    current = chunkSpans[chunkPos]; //use the current chunk
+                    chunkPos++;
+                } else if (pos != null){ //if no Chunker is used
+                    //build chunks based on POS tags. For that we have a list
+                    //of tags that are followed (backwards and forwards)
+                    int start = currentToken;
+                    while(start-1 > consumed && followPOS(pos[start-1])){
+                        start--; //follow backwards until consumed
+                    }
+                    int end = currentToken;
+                    while(end+1 < tokens.length && followPOS(pos[end+1])){
+                        end++; //follow forwards until consumed
+                    }
+                    current = new Span(start,end);
+                } else { //if no chunker and no POS tags just use the current token
+                    current = new Span(currentToken,currentToken);
+                }
+            } else { //ignore tokens with POS tags that are not included
+                current = null; 
+            }
+            if(current != null){
+                consumed = currentToken; //set consumed to the current token
+                //calculate the search string and search tokens for the currently
+                //active chunk
+                StringBuilder labelBuilder = null;
+                boolean first = true;
+                int startIndex = current.getStart();
+                int endIndex = current.getEnd();
+                //we need also the tokens to filter results that may be included
+                //because of the use of Tokenizers, Stemmers ...
+                List<String> searchTokens = new ArrayList<String>(current.length()+1);
+                for(int j = current.getStart();j<=current.getEnd();j++){
+                    if((pos == null && tokens[j].length() >= minSearchTokenLength) || 
+                            (pos != null && includePOS(pos[j]))){
+                        if(!first){
+                            labelBuilder.append(' ');
+                            endIndex = j; //update end
+                        } else {
+                            labelBuilder = new StringBuilder();
+                            startIndex = j; //set start
+                            endIndex = j; //set end
+                        }
+                        labelBuilder.append(tokens[j]);
+                        searchTokens.add(tokens[j]);
+                        first = false;
+                    }
+                }
+                String searchString = labelBuilder != null ? labelBuilder.toString() : null;
+                if(searchString != null && !suggestionCache.containsKey(searchString)){
+                    Suggestion suggestion = suggestionCache.get(searchString);
+                    if(suggestion != null){
+                        //create TextAnnotation for this selection and add it to
+                        //the suggestions.
+                        suggestion.addLinked(createTextAnnotation(
+                            offset, sentence, tokenSpans, ci, 
+                            startIndex, endIndex, suggestion.getTextAnnotationTypes()));
+                        log.debug("processed: Entity {} is now mentioned {} times",
+                            searchString,suggestion.getLinkedTextAnnotations().size());
+                    } else { //new word without an suggestion (suggestion == null)
+                        List<Representation> suggestions = searchSuggestions(site, ci, searchTokens, searchString,
+                            language, sentence, tokenSpans, offset, startIndex, endIndex);
+                        if(!suggestions.isEmpty()){
+                            //we need to parse the types to get the dc:type
+                            //values for the TextAnnotations
+                            Set<UriRef> textAnnotationTypes = new HashSet<UriRef>();
+                            for(Representation rep : suggestions){
+                                Iterator<Reference> types = rep.getReferences(NamespaceEnum.rdf+"type");
+                                log.info(" > Entity {}"+rep.getId());
+                                while(types.hasNext()){
+                                    Reference type = types.next();
+                                    log.info("  - type {}",type.toString());
+                                    UriRef textAnnotationType = DEFAULT_ENTITY_TYPE_MAPPINGS.get(type.getReference());
+                                    if(textAnnotationType != null){
+                                        textAnnotationTypes.add(textAnnotationType);
+                                    }
+                                }
+                            }
+                            UriRef textAnnotation = createTextAnnotation(
+                                offset, sentence, tokenSpans, ci, startIndex, endIndex,
+                                textAnnotationTypes);
+                            //create a new suggestion
+                            suggestion = new Suggestion(
+                                searchString, textAnnotation, suggestions, 
+                                textAnnotationTypes);
+                            //mark the current selection as "consumed"
+                            consumed = current.getEnd(); 
+                            //also set the current token to the last consumed
+                            //to prevent processing of consumed tokens
+                            currentToken = current.getEnd();
+                            log.debug("processed: First mention of Entity {} ",searchString);
+                        } else {
+                            log.debug("processed: No suggestion for Entity {} ",searchString);
+                            //will add NULL to the suggestion cache and therefore
+                            //blacklist this "searchString"
+                        }
+                        //NULL values are added to blacklist "searchStrings"
+                        suggestionCache.put(searchString, suggestion);
+                    }
+                } else if(log.isDebugEnabled()){ //ignore but do some debugging
+                    if(searchString != null){
+                        log.debug("ignore: {} already processed with no suggestions",searchString);
+                    } else {
+                        log.debug("ignore {}",
+                            new Span(tokenSpans[current.getStart()].getStart(),
+                                tokenSpans[current.getEnd()].getEnd()).getCoveredText(sentence));
+                    }
+                }
+            } else {
+                log.debug("ignore '{}'{}",tokens[currentToken],(pos!=null?'_'+pos[currentToken]:""));
+            }
+        }
+    }
+    private void wirteEntityEnhancements(Map<String,Suggestion> suggestionsCache,ContentItem ci,String nameField,String language){
+        for(Suggestion suggestion : suggestionsCache.values()){
+            if(suggestion != null){ //this map contains NULL values -> ignore them
+                //create EntityAnnotations for all the suggested Representations
+                Collection<UriRef> related;
+                if(suggestion.getLinkedTextAnnotations().isEmpty()){
+                    related = Collections.singleton((UriRef)suggestion.getTextAnnotation());
+                } else {
+                    related = new ArrayList<UriRef>(suggestion.getLinkedTextAnnotations().size()+1);
+                    related.add(suggestion.getTextAnnotation());
+                    related.addAll(suggestion.getLinkedTextAnnotations());
+                }
+                for(Representation rep : suggestion.getSuggestions()){
+                    EnhancementRDFUtils.writeEntityAnnotation(
+                        this, literalFactory, ci, related,rep, nameField, language);
+                }
+            }
+        }
+    }
+    /**
+     * Searches the {@link ReferencedSite} or the {@link Entityhub} (depending
+     * on the configuration) for Entities corresponding to the search string.
+     * Results are compaired to the search tokens (to avoid false positives
+     * based on tokenizers and stemming).
+     * @param site
+     * @param ci
+     * @param searchTokens
+     * @param searchString
+     * @param language
+     * @param sentence
+     * @param tokenSpans
+     * @param offset
+     * @param startIndex
+     * @param endIndex
+     * @param ciId
+     * @return The Entities suggested for the parsed searchString. An empty list
+     * indicates that no entities where found
+     * @throws EngineException
+     */
+    private List<Representation> searchSuggestions(ReferencedSite site,
+                                         ContentItem ci,
+                                         List<String> searchTokens,
+                                         String searchString,
+                                         String language,
+                                         String sentence,
+                                         Span[] tokenSpans,
+                                         int offset,
+                                         int startIndex,
+                                         int endIndex) throws EngineException {
+        List<Representation> processedResults;
+        FieldQuery query = site != null ? 
+                site.getQueryFactory().createFieldQuery() :
+                    entityhub.getQueryFactory().createFieldQuery();
+        query.addSelectedField(nameField);
+        query.addSelectedField(NamespaceEnum.rdfs+"comment");
+        query.addSelectedField(NamespaceEnum.rdf+"type");
+        query.addSelectedField(NamespaceEnum.rdfs+"seeAlso");
+        query.setConstraint(nameField, new TextConstraint(searchString));//,language));
+        //select 5 times the number of suggestion to allow some post
+        //filtering
+        //TODO: convert this to additional queries with offset
+        query.setLimit(Integer.valueOf(maxSuggestions*5)); 
+        QueryResultList<Representation> result;
+        try {
+            result = site != null ? site.find(query): entityhub.find(query);
+        } catch (EntityhubException e) {
+            throw new EngineException(this,ci,String.format(
+                "Unable to search for Entity wiht label '%s@%s'",
+                searchString,language),e);
+        }
+        if(!result.isEmpty()){
+            processedResults = new ArrayList<Representation>(maxSuggestions);
+            for(Iterator<Representation> it = result.iterator();it.hasNext() && processedResults.size()<maxSuggestions;){
+                Representation rep = it.next();
+                if(checkLabels(rep.getText(nameField),language,searchTokens)){
+                    //based on the configuration we might need to do things for
+                    //redirects (rdfs:seeAlso links)
+                    rep = processRedirects(site, rep, query.getSelectedFields());
+                    processedResults.add(rep);
+                } //else ignore this result
+            }
+        } else {
+            processedResults = Collections.emptyList();
+        }
+        return processedResults;
+    }
+    
+    public static enum RedirectProcessingState {
+        IGNORE,ADD_VALUES,FOLLOW
+    }
+    /**
+     * @param site
+     * @param rep
+     * @param fields
+     */
+    private Representation processRedirects(ReferencedSite site, Representation rep, Collection<String> fields) {
+        Iterator<Reference> redirects = rep.getReferences(NamespaceEnum.rdfs+"seeAlso");
+        switch (redirectState == null ? RedirectProcessingState.IGNORE : redirectState) {
+            case ADD_VALUES:
+                while(redirects.hasNext()){
+                    Reference redirect = redirects.next();
+                    if(redirect != null){
+                        try {
+                            Entity redirectedEntity = site != null ? 
+                                    site.getEntity(redirect.getReference()) : 
+                                        entityhub.getEntity(redirect.getReference());
+                            if(redirectedEntity != null){
+                                for(String field: fields){
+                                    rep.add(field, redirectedEntity.getRepresentation().get(field));
+                                }
+                            }
+                        } catch (EntityhubException e) {
+                            log.info(String.format("Unable to follow redirect to '%s' for Entity '%s'",
+                                redirect.getReference(),rep.getId()),e);
+                        }
+                    }
+                }
+                return rep;
+            case FOLLOW:
+                while(redirects.hasNext()){
+                    Reference redirect = redirects.next();
+                    if(redirect != null){
+                        try {
+                            Entity redirectedEntity = site != null ? 
+                                    site.getEntity(redirect.getReference()) : 
+                                        entityhub.getEntity(redirect.getReference());
+                            if(redirectedEntity != null){
+                                return redirectedEntity.getRepresentation();
+                            }
+                        } catch (EntityhubException e) {
+                            log.info(String.format("Unable to follow redirect to '%s' for Entity '%s'",
+                                redirect.getReference(),rep.getId()),e);
+                        }
+                    }
+                }
+                return rep; //no redirect found
+            default:
+                return rep;
+        }
+
+    }
+    /**
+     * Checks if the labels of an Entity confirm to the searchTokens. Because of
+     * Stemming an Tokenizers might be used for indexing the Dictionary this need
+     * to be done on the client side.
+     * @param labels the labels to check
+     * @param language the language
+     * @param searchTokens the required tokens
+     * @return <code>true</code> if a label was acceptable or <code>false</code>
+     * if no label was found
+     */
+    private boolean checkLabels(Iterator<Text> labels, String language, List<String> searchTokens) {
+        while(labels.hasNext()){
+            Text label = labels.next();
+            //NOTE: I use here startWith language because I want 'en-GB' labels accepted for 'en'
+            if(label.getLanguage() == null || label.getLanguage().startsWith(language)){
+                String text = label.getText().toLowerCase();
+                if(searchTokens.size() > 1){
+                    int foundTokens = 0;
+                    for(String token : searchTokens){
+                        if(text.indexOf(token.toLowerCase())>=0){
+                            foundTokens++;
+                        }
+                    }
+                    if(foundTokens == searchTokens.size() || foundTokens >= minFoundTokens ){
+                        return true;
+                    }
+                } else {
+                    //for single searchToken queries there are often results with 
+                    //multiple words. We need to filter those
+                    //e.g. for persons only referenced by the given or family name
+                    if(text.equalsIgnoreCase(searchTokens.get(0))){
+                        return true;
+                    }
+                }
+            }
+        }
+        return false;
+    }
+    /**
+     * @param sentenceOffset
+     * @param sentence
+     * @param tokenSpans
+     * @param metadata
+     * @param contentItemId
+     * @param startTokenIndex
+     * @param endTokenIndex
+     * @param dcTypes
+     * @return
+     */
+    private UriRef createTextAnnotation(int sentenceOffset,
+                                        String sentence,
+                                        Span[] tokenSpans,
+                                        ContentItem ci,
+                                        int startTokenIndex,
+                                        int endTokenIndex,
+                                        Set<UriRef> dcTypes) {
+        MGraph metadata = ci.getMetadata();
+        UriRef contentItemId = new UriRef(ci.getId());
+        UriRef textAnnotation = EnhancementEngineHelper.createTextEnhancement(metadata, this, contentItemId);
+        int startChar =  tokenSpans[startTokenIndex].getStart();
+        int endChar = tokenSpans[endTokenIndex].getEnd();
+        metadata.add(new TripleImpl(
+            textAnnotation,
+            Properties.ENHANCER_START,
+            literalFactory.createTypedLiteral(sentenceOffset+startChar)));
+        metadata.add(new TripleImpl(
+            textAnnotation,
+            Properties.ENHANCER_END,
+            literalFactory.createTypedLiteral(sentenceOffset+endChar)));
+        metadata.add(new TripleImpl(
+            textAnnotation,
+            Properties.ENHANCER_SELECTED_TEXT,
+            new PlainLiteralImpl(new Span(startChar, endChar).getCoveredText(sentence).toString())));
+        metadata.add(new TripleImpl(
+            textAnnotation,
+            Properties.ENHANCER_SELECTION_CONTEXT,
+            new PlainLiteralImpl(sentence)));
+        for(UriRef type : dcTypes){
+            metadata.add(new TripleImpl(textAnnotation, Properties.DC_TYPE, type));
+        }
+        return textAnnotation;
+    }
+
+    /**
+     * Set of POS tags used to build chunks of no {@link Chunker} is used.
+     * NOTE that all tags starting with 'N' (Nouns) are included anyway
+     */
+    public static final Set<String> followPosSet = Collections.unmodifiableSet(
+        new TreeSet<String>(Arrays.asList(
+            "#","$"," ","(",")",",",".",":","``","POS","CD","IN","FW")));//,"''")));
+    /**
+     * Set of POS tags used for searches.
+     * NOTE that all tags starting with 'N' (Nouns) are included anyway
+     */
+    public static final Set<String> searchPosSet = Collections.unmodifiableSet(
+        new TreeSet<String>(Arrays.asList(
+            "FW")));//,"''")));
+    /**
+     * TODO: This might be language specific!
+     * @param pos
+     * @return
+     */
+    private boolean followPOS(String pos){
+        return pos.charAt(0) == 'N' || followPosSet.contains(pos);
+    }
+    private boolean includePOS(String pos){
+        return pos.charAt(0) == 'N' || searchPosSet.contains(pos);
+    }
+    
+    /**
+     * @param language
+     * @return
+     */
+    private Tokenizer initTokenizer(String language) {
+        Tokenizer tokenizer;
+        if(useSimpleTokenizer ){
+            tokenizer = SimpleTokenizer.INSTANCE;
+        } else {
+            tokenizer = openNLP.getTokenizer(language);
+        }
+        return tokenizer;
+    }
+
+    /**
+     * @param language
+     * @return
+     */
+    private POSTaggerME initTagger(String language) {
+        POSTaggerME posTagger;
+        try {
+            POSModel posModel = openNLP.getPartOfSpeachModel(language);
+            if(posModel != null){
+                posTagger = new POSTaggerME(posModel);
+            } else {
+                log.debug("No POS Model for language {}",language);
+                posTagger = null;
+            }
+        } catch (IOException e) {
+            log.info("Unable to load POS Model for language "+language,e);
+            posTagger = null;
+        }
+        return posTagger;
+    }
+    /**
+     * @param language
+     * @return
+     */
+    private SentenceDetector initSentence(String language) {
+        SentenceDetector sentDetect;
+        try {
+            SentenceModel sentModel = openNLP.getSentenceModel(language);
+            if(sentModel != null){
+                sentDetect = new SentenceDetectorME(sentModel);
+            } else {
+                log.debug("No Sentence Detection Model for language {}",language);
+                sentDetect = null;
+            }
+        } catch (IOException e) {
+            log.info("Unable to load Sentence Detection Model for language "+language,e);
+            sentDetect = null;
+        }
+        return sentDetect;
+    }
+    /**
+     * @param language
+     */
+    private ChunkerME initChunker(String language) {
+        ChunkerME chunker;
+        try {
+            ChunkerModel chunkerModel = openNLP.getChunkerModel(language);
+            if(chunkerModel != null){
+                chunker = new ChunkerME(chunkerModel);
+            } else {
+                log.debug("No Chunker Model for language {}",language);
+                chunker = null;
+            }
+        } catch (IOException e) {
+            log.info("Unable to load Chunker Model for language "+language,e);
+            chunker = null;
+        }
+        return chunker;
+    }
+        
+
+    @Override
+    public Map<String,Object> getServiceProperties() {
+        return Collections.unmodifiableMap(Collections.singletonMap(
+            ENHANCEMENT_ENGINE_ORDERING,
+            (Object) defaultOrder));
+    }
+
+}

Propchange: incubator/stanbol/trunk/enhancer/engines/taxonomylinking/src/main/java/org/apache/stanbol/enhancer/engines/taxonomy/impl/TaxonomyLinkingEngine.java
------------------------------------------------------------------------------
    svn:mime-type = text/plain

Modified: incubator/stanbol/trunk/enhancer/generic/servicesapi/src/main/java/org/apache/stanbol/enhancer/servicesapi/rdf/OntologicalClasses.java
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/generic/servicesapi/src/main/java/org/apache/stanbol/enhancer/servicesapi/rdf/OntologicalClasses.java?rev=1141623&r1=1141622&r2=1141623&view=diff
==============================================================================
--- incubator/stanbol/trunk/enhancer/generic/servicesapi/src/main/java/org/apache/stanbol/enhancer/servicesapi/rdf/OntologicalClasses.java (original)
+++ incubator/stanbol/trunk/enhancer/generic/servicesapi/src/main/java/org/apache/stanbol/enhancer/servicesapi/rdf/OntologicalClasses.java Thu Jun 30 17:03:30 2011
@@ -23,6 +23,9 @@ public class OntologicalClasses {
     public static final UriRef DBPEDIA_ORGANISATION = new UriRef(
             NamespaceEnum.dbpedia_ont+"Organisation");
 
+    public static final UriRef SKOS_CONCEPT = new UriRef(
+        NamespaceEnum.skos+"Concept");
+
     private OntologicalClasses() {
     }
 

Modified: incubator/stanbol/trunk/enhancer/jersey/src/main/java/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource.java
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/jersey/src/main/java/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource.java?rev=1141623&r1=1141622&r2=1141623&view=diff
==============================================================================
--- incubator/stanbol/trunk/enhancer/jersey/src/main/java/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource.java (original)
+++ incubator/stanbol/trunk/enhancer/jersey/src/main/java/org/apache/stanbol/enhancer/jersey/resource/ContentItemResource.java Thu Jun 30 17:03:30 2011
@@ -4,6 +4,7 @@ import static javax.ws.rs.core.MediaType
 import static org.apache.stanbol.enhancer.servicesapi.rdf.OntologicalClasses.DBPEDIA_ORGANISATION;
 import static org.apache.stanbol.enhancer.servicesapi.rdf.OntologicalClasses.DBPEDIA_PERSON;
 import static org.apache.stanbol.enhancer.servicesapi.rdf.OntologicalClasses.DBPEDIA_PLACE;
+import static org.apache.stanbol.enhancer.servicesapi.rdf.OntologicalClasses.SKOS_CONCEPT;
 import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.GEO_LAT;
 import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.GEO_LONG;
 import static org.apache.stanbol.enhancer.servicesapi.rdf.Properties.NIE_PLAINTEXTCONTENT;
@@ -97,6 +98,10 @@ public class ContentItemResource extends
 
     protected Collection<EntityExtractionSummary> places;
 
+    protected Collection<EntityExtractionSummary> concepts;
+    
+    protected Collection<EntityExtractionSummary> others;
+
     public ContentItemResource(String localId,
                                ContentItem ci,
                                TripleCollection remoteEntityCache,
@@ -131,7 +136,8 @@ public class ContentItemResource extends
         defaultThumbnails.put(DBPEDIA_PERSON, getStaticRootUrl() + "/home/images/user_48.png");
         defaultThumbnails.put(DBPEDIA_ORGANISATION, getStaticRootUrl() + "/home/images/organization_48.png");
         defaultThumbnails.put(DBPEDIA_PLACE, getStaticRootUrl() + "/home/images/compass_48.png");
-
+        defaultThumbnails.put(SKOS_CONCEPT, getStaticRootUrl() + "/home/images/black_gear_48.png");
+        defaultThumbnails.put(null, getStaticRootUrl() + "/home/images/unknown_48.png");
     }
 
     public String getRdfMetadata(String mediatype) throws UnsupportedEncodingException {
@@ -174,6 +180,12 @@ public class ContentItemResource extends
         }
         return people;
     }
+    public Collection<EntityExtractionSummary> getOtherOccurrences() throws ParseException {
+        if(others == null){
+            others = getOccurrences(null);
+        }
+        return others;
+    }
 
     public Collection<EntityExtractionSummary> getOrganizationOccurrences() throws ParseException {
         if (organizations == null) {
@@ -188,25 +200,38 @@ public class ContentItemResource extends
         }
         return places;
     }
+    public Collection<EntityExtractionSummary> getConceptOccurrences() throws ParseException {
+        if (concepts == null) {
+            concepts = getOccurrences(SKOS_CONCEPT);
+        }
+        return concepts;
+    }
 
     public Collection<EntityExtractionSummary> getOccurrences(UriRef type) throws ParseException {
         MGraph graph = contentItem.getMetadata();
-        String q = "PREFIX enhancer: <http://fise.iks-project.eu/ontology/> "
-                   + "PREFIX dc:   <http://purl.org/dc/terms/> "
-                   + "SELECT ?textAnnotation ?text ?entity ?entity_label ?confidence WHERE { "
-                   + "  ?textAnnotation a enhancer:TextAnnotation ." 
-                   + "  ?textAnnotation dc:type %s ."
-                   + "  ?textAnnotation enhancer:selected-text ?text ." 
-                   + " OPTIONAL {"
-                   + "   ?entityAnnotation dc:relation ?textAnnotation ."
-                   + "   ?entityAnnotation a enhancer:EntityAnnotation . "
-                   + "   ?entityAnnotation enhancer:entity-reference ?entity ."
-                   + "   ?entityAnnotation enhancer:entity-label ?entity_label ."
-                   + "   ?entityAnnotation enhancer:confidence ?confidence . }" 
-                   + "} ORDER BY ?text ";
-        q = String.format(q, type);
+        StringBuilder queryBuilder = new StringBuilder();
+        queryBuilder.append("PREFIX enhancer: <http://fise.iks-project.eu/ontology/> ");
+        queryBuilder.append("PREFIX dc:   <http://purl.org/dc/terms/> ");
+        queryBuilder.append("SELECT ?textAnnotation ?text ?entity ?entity_label ?confidence WHERE { ");
+        queryBuilder.append("  ?textAnnotation a enhancer:TextAnnotation ." );
+        if(type != null){
+            queryBuilder.append("  ?textAnnotation dc:type ").append(type).append(" . ");
+        } else {
+            //append a filter that this value needs to be non existent
+            queryBuilder.append(" OPTIONAL { ?textAnnotation dc:type ?type } . ");
+            queryBuilder.append(" FILTER(!bound(?type)) ");
+        }
+        queryBuilder.append("  ?textAnnotation enhancer:selected-text ?text ." );
+        queryBuilder.append(" OPTIONAL {");
+        queryBuilder.append("   ?entityAnnotation dc:relation ?textAnnotation .");
+        queryBuilder.append("   ?entityAnnotation a enhancer:EntityAnnotation . ");
+        queryBuilder.append("   ?entityAnnotation enhancer:entity-reference ?entity .");
+        queryBuilder.append("   ?entityAnnotation enhancer:entity-label ?entity_label .");
+        queryBuilder.append("   ?entityAnnotation enhancer:confidence ?confidence . }" );
+        queryBuilder.append("} ORDER BY ?text ");
+//        String queryString = String.format(queryBuilder.toString(), type);
 
-        SelectQuery query = (SelectQuery) QueryParser.getInstance().parse(q);
+        SelectQuery query = (SelectQuery) QueryParser.getInstance().parse(queryBuilder.toString());
         ResultSet result = tcManager.executeSparqlQuery(query, graph);
         Map<String,EntityExtractionSummary> occurrenceMap = new TreeMap<String,EntityExtractionSummary>();
         LiteralFactory lf = LiteralFactory.getInstance();
@@ -221,8 +246,8 @@ public class ContentItemResource extends
             // TODO: collect the selected text and contexts of subsumed
             // annotations
 
-            TypedLiteral textLiteral = (TypedLiteral) mapping.get("text");
-            String text = lf.createObject(String.class, textLiteral);
+            Literal textLiteral = (Literal) mapping.get("text");
+            String text = textLiteral.getLexicalForm();
 
             EntityExtractionSummary entity = occurrenceMap.get(text);
             if (entity == null) {

Modified: incubator/stanbol/trunk/enhancer/jersey/src/main/resources/org/apache/stanbol/enhancer/jersey/templates/imports/contentitem.ftl
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/jersey/src/main/resources/org/apache/stanbol/enhancer/jersey/templates/imports/contentitem.ftl?rev=1141623&r1=1141622&r2=1141623&view=diff
==============================================================================
--- incubator/stanbol/trunk/enhancer/jersey/src/main/resources/org/apache/stanbol/enhancer/jersey/templates/imports/contentitem.ftl (original)
+++ incubator/stanbol/trunk/enhancer/jersey/src/main/resources/org/apache/stanbol/enhancer/jersey/templates/imports/contentitem.ftl Thu Jun 30 17:03:30 2011
@@ -26,6 +26,21 @@
 <@entities.listing entities=it.placeOccurrences /> 
 </#if>
 </div>
+
+<div class="entitylisting">
+<#if it.conceptOccurrences?size != 0>
+<h3>Concepts</h3>
+<@entities.listing entities=it.conceptOccurrences /> 
+</#if>
+</div>
+
+<div class="entitylisting">
+<#if it.otherOccurrences?size != 0>
+<h3>Others</h3>
+<@entities.listing entities=it.otherOccurrences /> 
+</#if>
+</div>
+
 </div>
 <div style="clear: both"></div>
 

Modified: incubator/stanbol/trunk/enhancer/pom.xml
URL: http://svn.apache.org/viewvc/incubator/stanbol/trunk/enhancer/pom.xml?rev=1141623&r1=1141622&r2=1141623&view=diff
==============================================================================
--- incubator/stanbol/trunk/enhancer/pom.xml (original)
+++ incubator/stanbol/trunk/enhancer/pom.xml Thu Jun 30 17:03:30 2011
@@ -55,6 +55,7 @@
     <module>engines/metaxa</module>
     <module>engines/geonames</module>
     <module>engines/entitytagging</module>
+    <module>engines/taxonomylinking</module>
     <!-- RICK based enhancement engine(s) -->
     <module>engines/opencalais</module>
     <module>engines/zemanta</module>