You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "Arndt, Dave (ELS-DAY)" <da...@Elsevier.com> on 2011/07/19 17:39:36 UTC

Jena on key-value store?

I am just getting into Jena.    I see that there are adapters for running Jena on an RDBMS, but I was also wondering if there are adapters to run a key-value store such as oracle coherence or memcached.

Regards,
Dave Arndt - Elsevier Enterprise Architect

Re: Jena on key-value store?

Posted by Andy Seaborne <an...@epimorphics.com>.

On 19/07/11 16:39, Arndt, Dave (ELS-DAY) wrote:
> I am just getting into Jena.    I see that there are adapters for
> running Jena on an RDBMS, but I was also wondering if there are
> adapters to run a key-value store such as oracle coherence or
> memcached.
>
> Regards, Dave Arndt - Elsevier Enterprise Architect
>
>

Dave,

Jena itself does not currently provide an adapter for a key-value store 
layer.  The storage layers the core project provides are:

in-memory
TDB - custom storage
SDB - adapter to SQL

but it's an open architecture.  There have been some Lucene/solr-based 
stores as well e.g. http://www.scottstreit.com/solrstore.html

The core of the API is

add(triple)
delete(triple)
find(?,?,?) for ? being a fixed value or a wildcard.

SPARQL builds on find() or goes directly to the storage layer but even 
here at the core of the query execution process the access unit is a 
range scan based on find(?,?,?) where in SDB and TDB cases, the ? may be 
a wildcard or a id identifying the node. This includes find(S,P,?) for 
example.

SPARQL employs joins to build graph patterns.

A key-value stores provides an access operation of "get(K)->V" so to 
utilise a KV store the core API needs to be layered on that.  The API 
does not provide the right style at scale.

You could layer on a KV store with an intermediate layer mapping access 
styles - I've done a version of TDB that used Project Voldemort as it's 
storage layer [1].  TDB uses B+Trees as the range indexes; the KV store 
is used to store the B+Tree blocks because at that level the lookup is 
block number -> bytes of block.

The restricted range scan requirement is the find(S,*,*) gives all PO 
for that S and find(S,P,*) gives all the O; etc for the other combinations.

The Jena in-memory design is closed to the KV pardigm.  It used a hash 
table to map a node to all the other components.  For example
S->PO.  This is KV-style so a store could be built on a KV platform.

But the granularity of access is very small.  A subject typically has a 
properties-values (PO), maybe 10 is quite a lot.  If the KV store is not 
in-process, access costs are in danger of killing the performance. 
Essentially, this is how we ended up with TDB - SDB uses JDBC and the 
interaction overheads, even when shipping larger unit of SPARQL basic 
graph patterns, can draft the efficiency of the SQL engine indexing.

You also have to decide what to do about P->SO and P->OS access.  Unless 
it's dbpedia, there are far fewer properties and so simply mapped to a 
KV store, a P-only lookup (which isn't common, at least in SPARQL) is 
only a few keys to a huge number of entries.

So there are interesting possibilities but it's not a simple matter of 
an adapter layer.

What is also interesting is looking ways to access RDF that isn't at the 
API or SPARQL query levels.  Looking up a URI to get a whole graph back 
- see the SPARQL HTTP Graph Store protocol [2].  That is a good match to 
KV stores.

	Andy

[1] https://github.com/afs/TDB-V
[2] http://www.w3.org/TR/sparql11-http-rdf-update/