You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@stanbol.apache.org by rw...@apache.org on 2011/09/23 13:08:13 UTC

svn commit: r1174655 - in /incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer: EnhancementStructureOverview.png stanbolenhancementstructure.mdtext

Author: rwesten
Date: Fri Sep 23 11:08:13 2011
New Revision: 1174655

URL: http://svn.apache.org/viewvc?rev=1174655&view=rev
Log:
Some updates to the Stanbol Enhancement Structure.

I am currently in the progress of updating this document based on the comments in the discussion thread on the stanbol-dev list [1].
This is ongoing work that will need some more time (and discussions). In the meantime I would like to keep this on the staging server.


[1] http://markmail.org/message/upzgzn5ew7cqa6ou

Added:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/EnhancementStructureOverview.png   (with props)
Modified:
    incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/stanbolenhancementstructure.mdtext

Added: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/EnhancementStructureOverview.png
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/EnhancementStructureOverview.png?rev=1174655&view=auto
==============================================================================
Binary file - no diff available.

Propchange: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/EnhancementStructureOverview.png
------------------------------------------------------------------------------
    svn:mime-type = application/octet-stream

Modified: incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/stanbolenhancementstructure.mdtext
URL: http://svn.apache.org/viewvc/incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/stanbolenhancementstructure.mdtext?rev=1174655&r1=1174654&r2=1174655&view=diff
==============================================================================
--- incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/stanbolenhancementstructure.mdtext (original)
+++ incubator/stanbol/site/trunk/content/stanbol/docs/trunk/enhancer/stanbolenhancementstructure.mdtext Fri Sep 23 11:08:13 2011
@@ -8,28 +8,33 @@ This describe the schema (ontology) used
 
 The Stanbol Enhancement Structure is build around the following main Concepts. Each of this concepts covers a specific aspect related to the enhancement process of content.
 
+
 The following list gives an overview about the concepts used by the Stanbol Enhancement Strucutre:
 
+![Overview about the Stanbol Enhancement Structure](/EnhancementStructureOverview.png "Overview of the Stanbol Enhancement Structure")
+
 * **ContentItem:** This is the resource representing the parsed content. The URI of this resource depends on how the content was parsed to the Stanbol Enhancer. In case an absolute URI is provided by the request, than this URI is used. In all other cased the Stanbol Enhancer creates an URI based on the configured prefix or the URL of the service. The documentation of the RESTful service should provide more information about that.
 
-* **Content:** Several content model distinguish between Content (data) and the ContentItem (Interpretation of the Data). The Enhancement Structure currently only defines ContentItem, because there is no need to describe the data for the purpose of the enhancement process. Other components (such as the /store endpoint) might need to formally describe the data. For such use cases the sic:content property will be used to refer from the ContentItem to the Content. The URI representing the Content will be the same to be used to retrieve its data via a RESTful service. 
+* **sb:Content:** Several content model distinguish between Content (data) and the ContentItem (Interpretation of the Data). The Enhancement Structure currently only defines ContentItem, because there is no need to describe the data for the purpose of the enhancement process. Other components (such as the /store endpoint) might need to formally describe the data. For such use cases the sic:content property will be used to refer from the ContentItem to the Content. The URI representing the Content will be the same to be used to retrieve its data via a RESTful service. 
+
+* **sb:Enhancement:** This provides metadata about extractions created by EnhancementEngines or present within the content. This includes the creator (usually a EnhancementEngine), the creation time, as well as relations to other enhancements. Users of the Stanbol Enhancer will typically not care about such data because out of the their perspective they represent Meta-Meta-Data (meta data about the metadata). Every feature, suggestion or other piece of information extracted by any EnhancementEngine need to attach the metadata defined for this concept.
+
+* **sb:Annotation:** An annotation describe some piece of knowledge extracted from the parsed content and/or the metadata of the content. Information provided by Annotations include the label, type and the confidence. In addition Annotations need to link at least to a single Occurrence and may have one or more Suggestions. Annotations can also be related/dependent to other Annotations. The EnhancementStructure defines only a small set of different Annotation types. Implementors of EnhancementEngines that extract specific kind of things (e.g. coreferences, events, …) may need to define there own Annotation types. Such Extensions should be called "**Annotation" and be defined as rdfs:subclass of any Annotation type defined by this Enhancement structure.
 
-* **Enhancement:** This provides metadata about extractions created by EnhancementEngines or present within the content. This includes the creator (usually a EnhancementEngine), the creation time, as well as relations to other enhancements. Users of the Stanbol Enhancer will typically not care about such data because out of the their perspective they represent Meta-Meta-Data (meta data about the metadata).
+* **sb:Suggestion** An suggestion describes an Resource (Entity, Topic, Category …) that an EnhancementEngine suggests as a possible match for an Annotation. Suggestions are typically created by Engines that further process - semantic lifting - of Annotations. However EnhancementEngines might also create both - the Annotation and the Suggestions. Suggestions are always linked to a single Annotations (functional property). They  define the label, the ID (typically the URI of the Resource), the type(s) of the suggested Resource and the confidence of the suggestion.
 
-* **Annotation:** An annotation describe a feature present within the parsed content. Such feature can have three sources. (1) the can originate form metadata present in the parsed content, (2) the can be extracted by analyzing the content itself and (3) they can be based on further processing Annotations of type (1) and (2). The Annotation provides the label, the type (e.g. Person, Organization, Location ) the role (e.g. Tag, Category, Keyword), the confidence and (if available) the link to the entity representing the extracted feature. It is the central concept for users that need to present all the things extracted from the parsed content.
+* **sb:Occurrence:** An Occurrence describes the actual location of an extracted feature within the content. This location may be within the content or within parsed metadata. Occurrences are always linked to a single Annotation (functional property). Based on the type of the content there will be different types of Occurrences. This EnhancementStructure currently focus on two types of Occurrences: (1) TextOccurrence and (2) MetadataOccurrence. For details on the model of such Occurrence types see the according sections. EnhancementEngines that support the extraction of Features from content types that are not covered by this Specification (e.g. Pictures, Sound, Video) need to define there own Occurrence types. Such types should use the name "***Occurrence" and be defined as rdfs:subClassOf any of the Occurrence types defined in this specification.
 
-* **Occurrence:** An Occurrence describes the actual location of the feature within the content or the metadata. Based on the type of the content there will be different types of Occurrences. A "text occurrence" will contain information such as the selected-text, the start/end position of the selection and the surrounding text to provide some context. An "image accurrence" will provide the top left and the bottom right position of the selected rectangle. A "metadata occurrence" will describe the property used for the annotation (e.g. dc:creator) the used standard (e.g. DCterms) and the value.
+Enhancements encoded based on this specification need to confirm to the following rules:
 
-When using the Enhancement Structure one need usually need to combine several of the above concepts to create meaningful statement.
-As an example take a natural language processing engine that needs to express the the word "Paris" found within an sentence like "I will travel to Prais next week" portably refers to a location.
-To express that it will need to combine the concepts 
+* sb:Annotation and sb:Suggestion MUST also be of type sb:Enhancement and include the required metadata defined by sb:Enhancement.
+* sb:Occurrences, sb:Annotations and Suggestions MUST include rdf:type information for all parent types. e.g. when adding a sb:TextOccurrences the rdf:type MUST include sb:TextOccurrence AND sb:Occurrences. Consumers are expected to NOT using any kind of reasoner therefore adding such additional information is the only way to ensure that queries for occurrences, annotations or suggestions provide the expected results.
 
-* Enhancement: to express that this feature was extracted by the Natural Language Processing Engine at a given time ...
-* Annotation: to express that "Paris" represents a "Location" and has the role "Tag"
-* Occurrence: to express where the selected text "Paris" is located within the analyzed content
+---
 
-The same is true for consuming Enhancements. A client interested in presenting Tags, Categories and Keywords needs only information provided by the Annotation concept. To be able to highlight the actual location of detected features within the content on needs to also process information provided by the Occurrence concept.
+The parts below are currently under work
 
+---
 
 ## Specification
 
@@ -148,6 +153,7 @@ The following properties are defined for
 * **sb:entity**: In case an annotation describes an Entity, this property provides the URI for the entity
 * **sb:entity-type**: In case an annotation describes an Entity, this property provides the rdf:types of the linked entity
 * **sb:suggestion**: Links to an other annotation that provides a suggestion for this one. This indicates that the Stanbol Enhancer requests the client to decide between the provided options - e.g. by some user interaction.
+* **sb:occurrence**: Optionally links to one or more sb:Occurrence of this annotation within the parsed Content. Note that there are several types of Occurrences (TextOccurrence, ImageOccurrence, MetadataOccurrence …) defined. If this property is missing, that the Annotation is assumed to be about the whole content (as referred to by the sb:extracted-from property).
 
 **Annotations Type** describe the type of the annotated feature based on a terminology standardized by Stanbol. Current types include
 
@@ -165,32 +171,147 @@ This list should only contain some types
 * sb:Tag: The feature can be suggested as tag for the parsed content.
 * sb:Category: The feature provides a categorization for the parsed content.
 * sb:Keyword: The feature describes a keyword within the parsed content TODO: describe the difference between keywords and tags
-* sb:Suggestion: The feature is a suggestion for an other Annotations. 
 
 *NOTE*: Such roles should make it more easy to support additional Annotations roles as suggested by [STANBOL-48](https://issues.apache.org/jira/browse/STANBOL-48) and [STANBOL-12](https://issues.apache.org/jira/browse/STANBOL-12) that includes [STANBOL-28](https://issues.apache.org/jira/browse/STANBOL-28) and [STANBOL-29](https://issues.apache.org/jira/browse/STANBOL-29).
 
-For **Suggestions** there are some additional constraints as defined by the following code block
+### sb:Suggestion
 
-    <a> rdf:type sb:Annotation
-    <a> dc:role !sb:Suggestion
-    <a> sb:suggestion <a1>
-        <a1> rdf:type sb:Annotation
-        <a1> dc:role sb:Suggestion
-        <a1> sb:confidence ordering^^xsd:float 
-
-This means:
-
-* an Annotation may only define suggestion if it does not have the dc:role sb:Suggestion. This prohibits nested suggestions
-* an Annotation lined by sb:suggestion con considered to be of the dc:role sb:Suggestion - even that it does not define this role explicitly.
-* Annotations used as suggestions MUST define some way to allow clients to show them in the right order (
-* the confidence value of annotations used as suggestions should be used to order suggestions when presented to the user. However Applications need to consider that such values are on an ordinal scale meaning that a value of "4" does NOT mean that it is twice as likely than a suggestion with an confidence of "2"!
+Suggestions are used by the Stanbol Enhancer to suggest possible values for the resolution features extracted from the parsed content. 
+Currently there are two different use cases for Suggestions defined
+
+* (1) Entity Resolution:* Suggests entities for an Feature extracted from the content. Typically such suggestions are calculated based on the name of the feature found within the content (e.g. the selected text of a sb:TextOccurrence).
+* (2) Field Value Suggestion:* Suggest a value for a specific property. This kind of suggestion are useful if an relation between two extracted features is detected. A typical example would be a person "Steve Jobs" with the role "CEO" of the company "Apple Inc". Such relations can be detected by NLP tools. However suggestions like this are also central for semantic lifting of RDFa annotations as shown in the example below.
+
+sb:Suggestion uses the following properties
+
+* **sb:entity**: The id of the suggested Entity
+* **sb:entity-type**: The type(s) of the suggested Entity
+* **sb:confidence**: Needed to sort in case of multiple suggestions
+* **sb:field**: Defines the property this suggestion should become the value if accepted by the user
+
+In addition all sb:Suggestions are also of type sb:Enhancement to allow EnhancementEngine to provide enhancement metadata for them.
+
+
+for details how they are used please see the following Example
+
+==== Example ====
+
+As example lets assume that the following RDFa annotated content is parsed to the Stanbol Enhancer
+
+   <span typeof="cal:Vevent">
+       <h3 property="dc:title"> Stanbol Teleconference </h3>
+       <span property="cal:summary>
+           <p> Agenda: </p>
+           <ul>
+               <li> ... </li>
+           <ul>
+           <p> Participants: </p>
+           <ul>
+               <li typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
+               <li typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>
+               <li> ... </li>
+           </ul>
+       </span>
+   </span>
+
+(1) Suggest the Entities for Rupert and Olivier
+(2) Suggest to link Rupert and Olivier as values for "cal:attendee"
+
+Both for Rupert Westenthaler and Olivier Grisel an EntityAnnotation would be present - in that case created by the RDFa extractor, but in principle this could also work if the RDFa markup is missing. In such cases the EntityAnnotations could be created by an NLPEnhancementEngine.
+
+   <a1> rdf:type sb:EntityAnnotation
+   <a1> dc:title Rupert Westenthaler
+   <a1> sb:entity-type foaf:Person
+   <a1> sb:hasOccurrence <o1>
+   <a1> sb:hasSuggestion <s1>
+
+   <a2> rdf:type sb:EntityAnnotation
+   <a2> dc:title Olivier Grisel
+   <a1> sb:entity-type foaf:Person
+   <a2> sb:hasOccurrence <o2>
+   <a2> sb:hasSuggestion <s2>
+
+Lets ignore the occurrences - because how to create Occurrences for RDFa markup is a whole different story that needs to be specified - and concentrate on the suggestions.
+
+   <s1> rdf:type sb:Suggestion
+   <s1> sb:entity <http://www.example.com/person/Rupert_Westenthaler>
+   <s1> sb:entity-type foaf:Person, vCard:vCard, dbpedia-ont:Person
+   <s1> sb:confidence 123,456
+
+   <s2> rdf:type sb:Suggestion
+   <s2> sb:entity <http://www.example.com/person/Olivier_Grisel>
+   <s2> sb:entity-type foaf:Person, vCard:vCard, dbpedia-ont:Person
+   <s2> sb:confidence 234,567
+
+If the suggestion is accepted by the client the RDFa markup could be updated like this
+
+   <li about="http://www.example.com/person/Rupert_Westenthaler"
+       typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
+   <li about="http://www.example.com/person/Olivier_Grisel"
+       typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>
+
+Now lets have a detailed look at the suggestions to add Rupert and Olivier as a "cal:attendee" to the meeting.
+First we need to create an EntityAnnotation for the Meeting that would be created by the RDFa extractor
+
+   <a> rdf:type sb:EntityAnnotation
+   <a> dc:title "Stanbol Teleconference"
+   <a> sb:entity-type cal:Vevent
+   <a> sb:hasOccurrence <o>
+   <a> sb:hasSuggestion <s3>
+   <a> sb:hasSuggestion <s4>
+
+Again lets skip the occurrence and look at the two suggestions. What I want to do here is to suggest to use the Annotations for Rupert (<a1>) and Olivier (<a2>) as values for the property "cal:attendee".
+
+It is important to suggest here the annotations <a1> and <a2> as values and NOT the suggested entities (e.g. <http://www.example.com/person/Rupert_Westenthaler> in case of <a1>) because the Stanbol Enhancer can not assume that the user will accepts the suggestions <s1> for <a1> and <s2> for <a2>.
+
+The following suggestions also use the sb:field property to tell the user that the suggestions is about values for the "cal:attendee" property.
+
+   <s3> rdf:type sb:Suggestion
+   <s3> sb:field cal:attendee
+   <s3> sb:entity <a1>
+   <s3> sb:entity-type sb:EntityAnnotation
+   <s3> sb:confidence 12,34
+
+   <s4> rdf:type sb:Suggestion
+   <s4> sb:field cal:attendee
+   <s4> sb:entity <a2>
+   <s4> sb:entity-type sb:EntityAnnotation
+   <s4> sb:confidence 12,34
+
+NOTE:
+
+* I am not sure if it is a good Idea to use "sb:entity" to link to an annotation created by the Stanbol Enhancer because it might confuse users if the same property is used to link external and internal resources. However introducing an additional property such as "sb:value" seam also not better.
+
+Here the RDFa markup if the user accepts <s3> and <s4> but not <s1> and <s2>
+
+   <span typeof="cal:Vevent">
+       [...]
+       <p> Participants: </p>
+       <ul property="cal:attendee">
+           <li typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
+           <li typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>
+           <li> ... </li>
+       </ul>
+   </span>
+
+and finally the RDFa markup if the all suggestions are accepted by the client side
+
+   <span typeof="cal:Vevent">
+       [...]
+       <p> Participants: </p>
+       <ul property="cal:attendee">
+           <li about="http://www.example.com/person/Rupert_Westenthaler"
+               typeof="foaf:Person" property="foaf:name">Rupert Westenthaler</li>
+           <li about="http://www.example.com/person/Olivier_Grisel"
+               typeof="foaf:Person" property="foaf:name">Olivier Grisel</li>
+       </ul>
+   </span>
 
 
 ### Occurrences
 
-By default detected Features are considered to be extracted from the whole content. While this assumption is appropriate for things like Categorizations and keywords for a lot of cases it is possible to specify the exact occurrence of features within the content and/or the metadata of the content.
+By default detected Features are considered to be extracted from the whole content. While this assumption is appropriate for things like Categorizations and keywords for a lot of cases it is possible to specify the exact occurrence of features within the content and/or the metadata of the content. In such cases the sb:Annotation will define one or more values for the sb:occurrence value.
 
-Typically Occurrences are used together with sb:Annotations and sb:Enhancement in cases an EnhancementEngine whats to describe the position of the extracted Feature within the analyzed content. So propertied defined by this two context should be considered when reading this section.
 
 Different Occurrence descriptions are needed to describe the position of a feature within different types of content or within the parsed metadata.