You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rafa Haro <rh...@apache.org> on 2015/03/04 16:56:34 UTC

Model for Linking Data

Hi all, 

Recently, while working on a post-processing engine, I have realized that currently it is not straightforward to deal with the data produced by Linking engines. Basically, in my opinion, the problem is that there is not currently easy to relate the results of NLP analysis with the results of the Linking process. After NLP analysis, all the extracted Spans (tokens, sentences, chunks and so on) are stored in an AnalyzedText object [1]. This model has a nice to use API and it really eases the work in the next engines within a chain. However, the result of the Linking Engines are currently only stored in the Clerezza graph holding the metadata of a ContentItem mainly as Text and Entity Annotations. Although there are some helpers to deal with the annotations within the graph, when developing a, let’s say, post-linking engine, a developer really miss a way to find, for example, the text and entity annotations that could be associated with the spans. The only way I have found without started to work on a good solution for this, has been to locate the spans associated to a Text Annotation by using the start and end offsets.

I would like to start a discussion here about the best design for tackling this problem.

Cheers,
Rafa

[1] - https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext



Re: Model for Linking Data

Posted by Magnus Knuth <ma...@hpi.uni-potsdam.de>.
Hi,

there is an RDF vocabulary for just doing this: the NLP InterchangeFormat (NIF) [1]
Maybe that helps.

[1] http://persistence.uni-leipzig.org/nlp2rdf/

Am 04.03.2015 um 16:56 schrieb Rafa Haro <rh...@apache.org>:

> Hi all, 
> 
> Recently, while working on a post-processing engine, I have realized that currently it is not straightforward to deal with the data produced by Linking engines. Basically, in my opinion, the problem is that there is not currently easy to relate the results of NLP analysis with the results of the Linking process. After NLP analysis, all the extracted Spans (tokens, sentences, chunks and so on) are stored in an AnalyzedText object [1]. This model has a nice to use API and it really eases the work in the next engines within a chain. However, the result of the Linking Engines are currently only stored in the Clerezza graph holding the metadata of a ContentItem mainly as Text and Entity Annotations. Although there are some helpers to deal with the annotations within the graph, when developing a, let’s say, post-linking engine, a developer really miss a way to find, for example, the text and entity annotations that could be associated with the spans. The only way I have found without started to work on a good solution for this, has been to locate the spans associated to a Text Annotation by using the start and end offsets.
> 
> I would like to start a discussion here about the best design for tackling this problem.
> 
> Cheers,
> Rafa
> 
> [1] - https://stanbol.apache.org/docs/trunk/components/enhancer/nlp/analyzedtext
> 
> 

-- 
Magnus Knuth

Hasso-Plattner-Institut für Softwaresystemtechnik GmbH
Prof.-Dr.-Helmert-Str. 2-3
14482 Potsdam

Amtsgericht Potsdam, HRB 12184
Geschäftsführung: Prof. Dr. Christoph Meinel

tel:     +49 331 5509 547
email:   magnus.knuth@hpi.de
web:     http://www.hpi.de/
webID:   http://magnus.13mm.de/