You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@stanbol.apache.org by Rupert Westenthaler <ru...@gmail.com> on 2012/08/03 11:11:53 UTC

Linked Data Infrastructure (was: ApacheCon EU CFP is open)

Hi all,

I think the best option would be to not use a RDF API at all. The
LDPath project [1] shows very nicely how this could work.

Let me explain how this could work

1. We define an API that allows to inspect all the things we need to
know about the data. This API needs to use generics - so that each RDF
framework can use their own classes and does not need to create
wrapper (as in the case of Clerezza). The RDFBackend [2] interface of
LDPath is such an Example.

2. One needs to implement this API for every framework one wants to
use with the proposed Linked Data Infrastructure. Some examples for
RDFBackend include Sesame [3], Stanbol Entityhub [4] (BTW not an RDF
framework), [5] Jena and [6] a specific Jena TDB implementation as
well as a [7] Clerezza based implementation.

3. Writing functionality byes on the generic API is not an easy task.
Especially if you need to combine data from different implementations
using different generic types. But typically this API will only to be
used by framework developers and not by normal users. The precess()
method of [8] is a good example of how to process data from potential
different RDFBackend implementations.

The advantage of such a design is however that you can ensure "native
processing chains" while having a shared API to write framework
specific functionalities e.g. processing an RESTful request for an
Linked Data Resource:

* The request for an Resource is processed by using the shared API
* The data are loaded for the Store by using the native type
* The JAX-RS Response is created by using the generic Entity (the
native instance of the Store)
* The MessageBodyWriter interface of JAX-RS selects the native RDF
serializer for the used RDF format and the parsed accept header.

A similar example would also work for SPARQL requests.

A Java API for Users - that do not want to use the low level API based
on generics - can also be added on top of this. e.g. converting the
Sesame Data returned by a request to an Clerezza in-memory Graph
containing the Resonse data and returning a GraphNode instance for the
requested Entity.

Or to give an other example: Some years ago I implemented a framework
using Sesame Elmo [1] (Java Object 2 RDF mapping tool) that allowed
users to use a Java Domain Model built based on OWL Ontologies. The
interesting thing was that is had not directly accessed the
TripleStore, but instead dynamically fetched/stored RDF graphs for RDF
resources referenced/modified by calls to the Java API. Something
like this could be also build on top of the framework as proposed
here.

Direct access to the Data stored in the Backend would be not possible
without using the low level API, but I think this is anyway not
necessary/intended for a Linked Data Infrastructure as access to
Linked Data is usually always in the context of a Resource and not on
the granularity of Triples (ok maybe with the exception of SPARQL
UPDATE ...)

WDYT
Rupert


[1] http://code.google.com/p/ldpath/
[2] http://code.google.com/p/ldpath/source/browse/ldpath-api/src/main/java/at/newmedialab/ldpath/api/backend/RDFBackend.java
[3] http://code.google.com/p/ldpath/source/browse/ldpath-backend-sesame/src/main/java/at/newmedialab/ldpath/backend/sesame/GenericSesameBackend.java
[4] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/ldpath/src/main/java/org/apache/stanbol/entityhub/ldpath/backend/
[5] http://code.google.com/p/ldpath/source/browse/ldpath-backend-jena/src/main/java/at/newmedialab/ldpath/backend/jena/GenericJenaBackend.java
[6] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/source/jenatdb/src/main/java/org/apache/stanbol/entityhub/indexing/source/jenatdb/RdfIndexingSource.java
[7] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/commons/ldpath/clerezza/src/main/java/org/apache/stanbol/commons/ldpath/clerezza/ClerezzaBackend.java
[8] http://svn.apache.org/repos/asf/incubator/stanbol/trunk/entityhub/indexing/core/src/main/java/org/apache/stanbol/entityhub/indexing/core/processor/LdpathSourceProcessor.java
[9] http://www.openrdf.org/doc/elmo/1.5/user-guide.html


On Fri, Aug 3, 2012 at 10:08 AM, Sebastian Schaffert
<se...@salzburgresearch.at> wrote:
> Hi Tommaso,
>
> I have worked a bit with Clerezza in the context of Stanbol. While I can see the advantages of being triple-store agnostic and having a large OSGi infrastructure already in place, currently we do not intend to base the implementation on Clerezza. There are a number of reasons for this:
> - we implement a reasoning and a versioning functionality that require low-level access to the triple store to be efficient; this would mean that we rather implement a triple store implementation beneath Sesame or Jena and use these APIs than the other way round; since both are considered "killer-features" by many users, I would not like to remove this functionality
> - the Linked Data Server should be capable of handling hundreds of millions or even billions of triples (our current biggest productive scenario has around 140 million triples) and efficiently run queries (SPARQL) over this dataset; an intermediate layer in the way implemented by Clerezza is not suitable for this (running efficient SPARQL queries requires translating them into queries on the storage layer - e.g. SQL - and cannot be done on top of the API level)
> - Clerezza essentially provides YET ANOTHER API for managing triples; consequence: developers need to learn its specifics, the new API needs to be fire-proven again even though the other two (Sesame and Jena) are already established, the existing standards (SPARQL query+update, various RDF serialization formats, …) need to be implemented yet-again
>
> At the moment, our implementation builds mostly on the Sesame API because it is in our opinion by far the cleanest existing API for working with RDF (in fact, I consider it very well thought out and consistent, I have worked with both Sesame and Jena for many years). This is of course not a hard constraint, but for the time being I would keep it like this to avoid unnecessary additional effort. In the future, however, it might make sense to use Clerezza as a "middle layer" inbetween our low-level and our high-level functionalities (depending on how Clerezza itself develops).
>
> If you want to, we can discuss this also at ApacheCon, perhaps I am missing something. ;-)
>
> Greetings,
>
> Sebastian
>
>
> Am 02.08.2012 um 11:11 schrieb Tommaso Teofili:
>
>> Hi Sebastian,
>>
>> that looks very good, reading your idea of a proposal (which I like) I am
>> wondering what do you think about using / extending Apache Clerezza as the
>> basic infrastructure for that.
>> Reasons for that would be: it's triple store agnostic (and extensible), can
>> "offer" REST interfaces for the underlying data, Stanbol is based on it.
>> Surely it should be improved for scalability purposes (i.e. building a
>> cluster of Clerezza instances).
>>
>> Thanks for sharing your nice proposal.
>> Looking forward to meet you at the ApacheCon.
>> Cheers,
>> Tommaso
>>
>> 2012/8/2 Sebastian Schaffert <se...@salzburgresearch.at>
>>
>>> Dear all,
>>>
>>> Rupert and I would also like to propose a presentation about a "Apache
>>> Linked Data Server", which could continue the work we have so far been
>>> doing in the integration of our Linked Media Framework and Apache Stanbol.
>>> The idea here would that we submit significant parts of the LMF as an
>>> Apache incubator proposal that would closely integrate with Stanbol, which
>>> could eventually result in the mentioned "Apache Linked Data Server".
>>>
>>> Why would this be useful? More and more institutions participate in "open
>>> data" initiatives, but open data currently mostly means simply publishing a
>>> CSV or XLS file on a Web server. Semantic Web technology and Linked Data
>>> would offer much better interoperability, but if you really plan to publish
>>> data, the amount of work required is significant (e.g. setting up a
>>> Virtuoso server) and does not integrate so well with other technologies
>>> that are in use in institutions. An Open Source Apache project that could
>>> be deployed easily would make it much easier, especially for public
>>> institutions or small enterprises, to offer their data publically.
>>>
>>> I have some screencasts already on how this works with the LMF:
>>>
>>> http://code.google.com/p/lmf/
>>>
>>> And we would like to discuss in what form it would make sense to transform
>>> this into an Apache project. The presentation at ApacheCon could be a first
>>> presentation of this idea.
>>>
>>> Greetings,
>>>
>>> Sebastian
>>>
>>> Am 02.08.2012 um 09:11 schrieb Fabian Christ:
>>>
>>>> Hi,
>>>>
>>>> I do not know how many presentation will be potentially accepted. I
>>>> would assume that there is not that much room for more than one
>>>> presentation from each podling but that is just a guess. We will see.
>>>>
>>>> Okay then, I will submit a proposal for an overview talk about Stanbol
>>>> which is lead by use cases and scenarios where Stanbol may be a
>>>> helpful framework for people, as Tommaso suggested.
>>>>
>>>> Best,
>>>> - Fabian
>>>>
>>>> 2012/7/31 Suat Gonul <su...@gmail.com>:
>>>>> Hi all,
>>>>>
>>>>> I can submit a presentation about the Contenthub. It's mainly about
>>>>> Contenthub, but it also includes Enhancer, Entityhub and CMS Adapter
>>>>> components. The use case may be similar to our previously ehealth
>>>>> demonstration [1]. In terms of its content, it may well suit after
>>>>> Rupert's presentation. So, the use case may be as follows:
>>>>>
>>>>>   * A content administrator would configure a few healthcare datasets
>>>>>     as Referenced/Managed Sites based on the indexing facilities of
>>>>>     Entityhub.
>>>>>   * He would configure KeywordLinkingEngine's associated with those
>>>>>     Sites to be able to do health domain specific enhancements.
>>>>>   * After analyzing the details of the external datasets, he defines
>>>>>     an LDPath which is compatible with the external datasets. This
>>>>>     LDPath is used as the configuration of a Solr based SemanticIndex.
>>>>>   * In the next step, the admin configures a CMS Adapter based Store
>>>>>     associated with his workspace in the CMS.
>>>>>   * When he creates/updates document on the CMS, the Store keeps track
>>>>>     of the changes on the CMS documents and enhance them automatically.
>>>>>   * The LDPath based SemanticIndex becomes aware of the changes in the
>>>>>     Store and indexes the documents according to its LDPath
>>>>>     configuration. In this process, it also gathers additional
>>>>>     information from the external datasets for the named entities
>>>>>     recognized in the documents from the ManagedSites associated with
>>>>>     the external datasets.
>>>>>   * As a result, the admin would have a semantically enhanced Solr
>>>>>     index considering in terms health domain and he can use the index
>>>>>     directly through its RESTful API.
>>>>>
>>>>> I hope the use case is clear. What do you think?
>>>>>
>>>>> Best,
>>>>> Suat
>>>>>
>>>>> [1] http://www.youtube.com/watch?v=l7n6aRFcn1U
>>>>>
>>>>>
>>>>> On 07/31/2012 11:40 AM, Rupert Westenthaler wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I think I will go than for a presentation about how to use the Stanbol
>>>>>> Enhancer to link Linked Data Entities.
>>>>>>
>>>>>> * Intro
>>>>>> * NER vs. KeywordExtraction (typical Chain configurations, used
>>>>>> Enhancement Engines
>>>>>> * Indexing Datasets for the Entityhub
>>>>>> * Managing Entities via the RESTful Interface (Entityhub Managed Sites)
>>>>>>
>>>>>> This assumes that there is also a more general overview about Stanbol
>>>>>> and the Stanbol Enhancer - hoping @Fabian for that.
>>>>>>
>>>>>> WDYT
>>>>>> best
>>>>>> Rupert Westenthaler
>>>>>>
>>>>>> On Thu, Jul 26, 2012 at 3:47 PM, Tommaso Teofili
>>>>>> <to...@gmail.com> wrote:
>>>>>>> In my opinion starting from a real use case which demonstrates a
>>> subset of
>>>>>>> the whole set of features is a good way of catching audience's
>>> attention.
>>>>>>> My 2 cents,
>>>>>>> Tommaso
>>>>>>>
>>>>>>> 2012/7/26 Rupert Westenthaler <ru...@gmail.com>
>>>>>>>
>>>>>>>> On Thu, Jul 26, 2012 at 11:04 AM, Fabian Christ
>>>>>>>> <ch...@googlemail.com> wrote:
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I just want to point out that the CFP deadline is 3rd August, 2012.
>>>>>>>>>
>>>>>>>> Yes I am interested ...
>>>>>>>>
>>>>>>>> Should we aim for a Stanbol overview talk or rather concentrate on -
>>>>>>>> maybe several - specific features/components
>>>>>>>>
>>>>>>>> best
>>>>>>>> Rupert
>>>>>>>>
>>>>>>>>> Is there any interest from other committers submit a talk?
>>>>>>>>>
>>>>>>>>> Best,
>>>>>>>>> - Fabian
>>>>>>>>>
>>>>>>>>> 2012/7/17 Fabian Christ <ch...@googlemail.com>:
>>>>>>>>>> Hi,
>>>>>>>>>>
>>>>>>>>>> the call for papers and talks for the upcoming ApacheCon EU is open
>>>>>>>>>>
>>>>>>>>>> http://www.apachecon.eu/cfp/
>>>>>>>>>>
>>>>>>>>>> I would be interested in submitting a paper/talk about Stanbol.
>>>>>>>>>> Perhaps we should start to collect ideas here. Anyone else here
>>>>>>>>>> interested in writing something?
>>>>>>>>>>
>>>>>>>>>> Best,
>>>>>>>>>> - Fabian
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> Fabian
>>>>>>>>>> http://twitter.com/fctwitt
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> Fabian
>>>>>>>>> http://twitter.com/fctwitt
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> | Rupert Westenthaler             rupert.westenthaler@gmail.com
>>>>>>>> | Bodenlehenstraße 11                             ++43-699-11108907
>>>>>>>> | A-5500 Bischofshofen
>>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Fabian
>>>> http://twitter.com/fctwitt
>>>
>>> Sebastian
>>> --
>>> | Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
>>> | Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
>>> | Head of Knowledge and Media Technologies Group          +43 662 2288 423
>>> | Jakob-Haringer Strasse 5/II
>>> | A-5020 Salzburg
>>>
>>>
>
> Sebastian
> --
> | Dr. Sebastian Schaffert          sebastian.schaffert@salzburgresearch.at
> | Salzburg Research Forschungsgesellschaft  http://www.salzburgresearch.at
> | Head of Knowledge and Media Technologies Group          +43 662 2288 423
> | Jakob-Haringer Strasse 5/II
> | A-5020 Salzburg
>



-- 
| Rupert Westenthaler             rupert.westenthaler@gmail.com
| Bodenlehenstraße 11                             ++43-699-11108907
| A-5500 Bischofshofen