You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@stanbol.apache.org by David Riccitelli <da...@insideout.io> on 2013/02/21 12:16:23 UTC

TextAnnotations New Model

Hello Stanbolers,

I created STANBOL-953 related to the Text Annotations New Model engine.

Here's some background: Stanbol defines new specifications for the Text
Annotations definitions as part of the result of an enhancement analysis.
These specifications are published on the official web site [
http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html#fisetextannotation
].

Their aim is to add the head/tail and prefix/suffix information to a Text
Annotation. This would greatly benefit dependent services that somehow need
to "clean-up" the textual contents before sending them for analysis, while
receiving meaningful information about linking the identified entities with
the related labels in the text (without using the unreliable start/end
information).

In order to jump-start support for the head/tail and prefix/suffix model,
we created a TextAnnotations-NewModel engine [
https://github.com/insideout10/wordlift-stanbol/tree/master/textannotations-futuremodel]
which is converting start/end information to head/tail/prefix/suffix
information before the analysis results are returned to the client.

This engine was previously announced to the dev mailing list [
http://mail-archives.apache.org/mod_mbox/stanbol-dev/201211.mbox/%3CCAG94HGi2MiSWgtvYU7-bNqgQVmGRc0w7vL1CZEzV-Fc4XNSjrg@mail.gmail.com%3E
].

BR,
David

Re: TextAnnotations New Model

Posted by Rupert Westenthaler <ru...@gmail.com>.

Hi David, all

I am very positive about the "new" model as included in this mail:

 * fise:selection-prefix: some words/characters before the selected section.
 * fise:selection-head: the first few word/characters of a the
selected section within the text. Alternative to fise:selected-text in
case bigger sections of the parsed content need to be selected.
 * fise:selection-tail: the last few words/characters of a selected
section. To be used together with fise:selection-head.
 * fise:selection-suffix: some words/characters after the selected section.

Having selection-prefix and selection-suffix will make it much easier
to determine the exact position of an fise:TextAnnotation if the char
indexes can not be used (e.g. within HTML documents). The current
model used by the fise:TextAnnotation makes this hard. Especially of
the fise:selected-text is contained several times within the
fise:selection-context.

The introduction of fise:selection-head and fise:selection-tail will
allow to select bigger parts of documents (e.g. sentences or whole
sections). For such selections fise:selected-text is not feasible as
it would require to duplicate the whole section as RDF literal within
the metadata of the ContentItem.

With this changes the fise:selection-context would be solely used for
defining the part of the Text used to extract a given Enhancement from
the text (e.g. the Sentence analyzed by the NER module to  detect a
Named Entity). Its usage to determine the location of the selected
text would be deprecated.

The contributed EnhancementEngine provides a very strong migration
path for the new model. It would even allow users of the 0.10.0
version of the Stanbol Enhancer to migrate to the new
fise:TextAnnotation model. For future release EnhancementEngines
should be adapted to directly support the new model.

Given all that I would suggest that we introduce this new model.

WDYT
Rupert

On Thu, Feb 21, 2013 at 12:16 PM, David Riccitelli <da...@insideout.io> wrote:
> Hello Stanbolers,
>
> I created STANBOL-953 related to the Text Annotations New Model engine.
>
> Here's some background: Stanbol defines new specifications for the Text
> Annotations definitions as part of the result of an enhancement analysis.
> These specifications are published on the official web site [
> http://stanbol.apache.org/docs/trunk/components/enhancer/enhancementstructure.html#fisetextannotation
> ].
>
> Their aim is to add the head/tail and prefix/suffix information to a Text
> Annotation. This would greatly benefit dependent services that somehow need
> to "clean-up" the textual contents before sending them for analysis, while
> receiving meaningful information about linking the identified entities with
> the related labels in the text (without using the unreliable start/end
> information).
>
> In order to jump-start support for the head/tail and prefix/suffix model,
> we created a TextAnnotations-NewModel engine [
> https://github.com/insideout10/wordlift-stanbol/tree/master/textannotations-futuremodel]
> which is converting start/end information to head/tail/prefix/suffix
> information before the analysis results are returned to the client.
>
> This engine was previously announced to the dev mailing list [
> http://mail-archives.apache.org/mod_mbox/stanbol-dev/201211.mbox/%3CCAG94HGi2MiSWgtvYU7-bNqgQVmGRc0w7vL1CZEzV-Fc4XNSjrg@mail.gmail.com%3E
> ].
>
> BR,
> David

--
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen