You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Tobias Neef <to...@gmail.com> on 2012/04/24 15:27:02 UTC

How to implement a custom JENA Backend

Hi,

I am currently thinking about developing a custom "Backend" for Jena
just like the relational one and the native TDB. It would be great if
you could give me some general hints how the process of developing
such a backend would look like.

Best regards,
Tobias Neef

Re: How to implement a custom JENA Backend

Posted by Andy Seaborne <an...@apache.org>.

<> rdfs:seeAlso <https://diuf.unifr.ch/main/xi/diplodocus> .

<https://diuf.unifr.ch/main/xi/diplodocus>
    rdfs:label "dipLODocus[RDF]" .


On 24/04/12 19:56, Andy Seaborne wrote:
> On 24/04/12 14:57, Paolo Castagna wrote:
>> Tobias Neef wrote:
>>> Hi Paolo,
>>>
>>> thanks for the quick response! The reason for doing this is, because I
>>> think it would be useful to have a RDF-Database with SPARQL-Interface
>>> which can be used as a PAAS offering like Amazon RDS or Amazon Dynamo
>>> DB: For the developer this would mean no hassle about replication, or
>>> scaling etc. To some extend you can achieve that when using Jena SDB
>>> on top of something like Amazon RDS or MS SQL Azure. I want to try how
>>> far I can get when I use Jena as API and map it to something like
>>> Dynamo DB or MS Azure Tables which have quite unique
>>> Scalability/Availability characteristics. There is for example
>>> http://datomic.com/ which also goes along those lines. They
>>> implemented it on top of Dynamo DB but with a custom query language.
>>>
>>> Does that make sense from your perspective?
>
> Hi Tobias,
>
> Interesting space and it would be great to have such a service.
>
> There are quite a few design choices to make and they can greatly
> influence the desing. For example: a service that offered replication
> etc and had many datasets can be built using one dataset per machine as
> the unit. It scales in total data but not in data-per-dataset or graph.
>
> A service that specialised in massive data (more about data management
> than raw query performance; maybe like a column store if aggregation
> queries matter) if different from one giving as-near-real-time response
> for UIs (basically, in-memory or the working set is in-memory).
>
> In terms of where to start,
>
> SDB if you are building on top of an SQL service
>
> TDB, or the shell of TDB, if you building on what amounts to a index
> service. TDB is built on top of indexes - you can plug in your own.
>
> I have built a TDB that used Project Voldemort as a block store for the
> TDB B+Trees. It worked quite well but as highly scalable base, it's
> limited as too much work ends up on the query engine and not enough of
> the indexing work access work is doe in the cluster.
>
> As for examples: see http://www.dydra.com/ which is SPARQL.
>
> Andy
>

Re: How to implement a custom JENA Backend

Posted by Paolo Castagna <ca...@googlemail.com>.

Tobias Neef wrote:
> Hi Andy,
...
> Is there a chance that I can take a look at your project code?

Here: https://github.com/afs/TDB-V (interesting project!) ;-)

A different approach, a la SDB over HBase is here:
https://github.com/castagna/hbase-rdf
... however, if you want to implement SPARQL you are better of with a storage
systems which provides you with JOINs (HBase doesn't).

And, an approach as TDB-V is certainly better.

Indeed, even more help (specific to RDF) from the distributed store, as Andy
said, is needed and welcome... but at that point you need to implement your own
distributed store (which isn't a simple matter of programming ;-)).

Paolo

Re: How to implement a custom JENA Backend

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> (hmm - I wonder if there is any sindice related work that analyses a
> SPARQL query to see if it can be mapped to a Sindice index because
> sindice can do a subset of graph patterns.

Sindice --> SIREn, here: http://siren.sindice.com/
Code is here: https://github.com/rdelbru/SIREn/

Paolo

Re: How to implement a custom JENA Backend

Posted by Andy Seaborne <an...@apache.org>.

On 25/04/12 09:33, Tobias Neef wrote:
> Hi Andy,
>
> thanks a lot for your insight. I agree with you that there is no free
> cake. Most of the highly scalable stores have a very specific usage
> scenario which is there sweet spot. Also the qualities of such a
> service would depend on the mapping strategy you choose.
>
>> I have built a TDB that used Project Voldemort as a block store for the TDB
>> B+Trees.  It worked quite well but as highly scalable base, it's limited as
>> too much work ends up on the query engine and not enough of the indexing
>> work access work is doe in the cluster.
>
> Not sure if I quite get your point there. What do you mean with "and
> not enough of the indexing work access work is doe in the cluster"? Do
> you mean that this architecture would be most suited for a read /
> query intensive scenario rather than a frequent update one?

Using V (Project Voldemort), then the cluster nodes are serving up disk 
blocks.  All the B+Tree traversal is done in the query processor so 
there is less utilization of the cluster node CPUs.  V is being a disk - 
a fast, large fault tolerant one but a disk.

If the cluster nodes are themselves (partial) indexes, you can ask "get 
all triple matching (S,P,?)" the index work on the cluster nodes.  That 
also enables parallelism. Filtering can be done before network transfer 
as well - e.g. asks for "(S,P,?) where ? is more than 25"

> Your approach seems to be similar to some recently published approach,
> which is the only research paper I have found in this area:
> http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/danac2012/a4-bugiotti.pdf.

Thanks for that link.

> Is there a chance that I can take a look at your project code?

If you really want to :-) ....

https://github.com/afs/TDB-V

>
> I know http://www.dydra.com/ but they haven't published anything yet
> on how they manage their store. And the testing your can do seems to
> be rather limited due to the beta constraints the currently have.

I have experimented with a proper cluster engine but (1) the internal 
design was wrong [it could lock up - design flaw] and (2) I'm not sure 
what the legal situation is for it due to a change of employer.

--

Another thought:

Several of the query languages over HDFS use an SQL-like syntax to 
access the abilities of the underlying storage.  It's not SQL - it's a 
restricted SQL (ish thing) that exposes capabilities in a familiar way. 
  MySQL started the same way - no joins , a restricted SQL that accessed 
one table with value restrictions.

That could be applied to SPARQL - have a subset of SPARQL over Amazon 
Dynamo/ Cassandra / other noSQL store of your choice.

(hmm - I wonder if there is any sindice related work that analyses a 
SPARQL query to see if it can be mapped to a Sindice index because 
sindice can do a subset of graph patterns.

	Andy

>
>
> On Tue, Apr 24, 2012 at 8:56 PM, Andy Seaborne<an...@apache.org>  wrote:
>> On 24/04/12 14:57, Paolo Castagna wrote:
>>>
>>> Tobias Neef wrote:
>>>>
>>>> Hi Paolo,
>>>>
>>>> thanks for the quick response! The reason for doing this is, because I
>>>> think it would be useful to have a RDF-Database with SPARQL-Interface
>>>> which can be used as a PAAS offering like Amazon RDS or Amazon Dynamo
>>>> DB: For the developer this would mean no hassle about replication, or
>>>> scaling etc. To some extend you can achieve that when using Jena SDB
>>>> on top of something like Amazon RDS or MS SQL Azure. I want to try how
>>>> far I can get when I use Jena as API and map it to something like
>>>> Dynamo DB or MS Azure Tables which have quite unique
>>>> Scalability/Availability characteristics. There is for example
>>>> http://datomic.com/ which also goes along those lines. They
>>>> implemented it on top of Dynamo DB but with a custom query language.
>>>>
>>>> Does that make sense from your perspective?
>>
>>
>> Hi Tobias,
>>
>> Interesting space and it would be great to have such a service.
>>
>> There are quite a few design choices to make and they can greatly influence
>> the desing.  For example: a service that offered replication etc and had
>> many datasets can be built using one dataset per machine as the unit.  It
>> scales in total data but not in data-per-dataset or graph.
>>
>> A service that specialised in massive data (more about data management than
>> raw query performance; maybe like a column store if aggregation queries
>> matter) if different from one giving as-near-real-time response for UIs
>> (basically, in-memory or the working set is in-memory).
>>
>> In terms of where to start,
>>
>> SDB if you are building on top of an SQL service
>>
>> TDB, or the shell of TDB, if you building on what amounts to a index
>> service.  TDB is built on top of indexes - you can plug in your own.
>>
>> I have built a TDB that used Project Voldemort as a block store for the TDB
>> B+Trees.  It worked quite well but as highly scalable base, it's limited as
>> too much work ends up on the query engine and not enough of the indexing
>> work access work is doe in the cluster.
>>
>> As for examples: see http://www.dydra.com/ which is SPARQL.
>>
>>         Andy
>>

Re: How to implement a custom JENA Backend

Posted by Milorad Tosic <mb...@yahoo.com>.

Hi,

This is an interesting topic, indeed. I haven't done any research on scalable RDF stores but I accidentally run onto http://www.systap.com/bigdata.htm that claims something like that. 

My two cents ...

Milorad




>________________________________
> From: Tobias Neef <to...@gmail.com>
>To: jena-users@incubator.apache.org 
>Sent: Wednesday, April 25, 2012 10:33 AM
>Subject: Re: How to implement a custom JENA Backend
> 
>Hi Andy,
>
>thanks a lot for your insight. I agree with you that there is no free
>cake. Most of the highly scalable stores have a very specific usage
>scenario which is there sweet spot. Also the qualities of such a
>service would depend on the mapping strategy you choose.
>
>> I have built a TDB that used Project Voldemort as a block store for the TDB
>> B+Trees.  It worked quite well but as highly scalable base, it's limited as
>> too much work ends up on the query engine and not enough of the indexing
>> work access work is doe in the cluster.
>
>Not sure if I quite get your point there. What do you mean with "and
>not enough of the indexing work access work is doe in the cluster"? Do
>you mean that this architecture would be most suited for a read /
>query intensive scenario rather than a frequent update one?
>
>Your approach seems to be similar to some recently published approach,
>which is the only research paper I have found in this area:
>http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/danac2012/a4-bugiotti.pdf.
>
>Is there a chance that I can take a look at your project code?
>
>I know http://www.dydra.com/ but they haven't published anything yet
>on how they manage their store. And the testing your can do seems to
>be rather limited due to the beta constraints the currently have.
>
>
>On Tue, Apr 24, 2012 at 8:56 PM, Andy Seaborne <an...@apache.org> wrote:
>> On 24/04/12 14:57, Paolo Castagna wrote:
>>>
>>> Tobias Neef wrote:
>>>>
>>>> Hi Paolo,
>>>>
>>>> thanks for the quick response! The reason for doing this is, because I
>>>> think it would be useful to have a RDF-Database with SPARQL-Interface
>>>> which can be used as a PAAS offering like Amazon RDS or Amazon Dynamo
>>>> DB: For the developer this would mean no hassle about replication, or
>>>> scaling etc. To some extend you can achieve that when using Jena SDB
>>>> on top of something like Amazon RDS or MS SQL Azure. I want to try how
>>>> far I can get when I use Jena as API and map it to something like
>>>> Dynamo DB or MS Azure Tables which have quite unique
>>>> Scalability/Availability characteristics. There is for example
>>>> http://datomic.com/ which also goes along those lines. They
>>>> implemented it on top of Dynamo DB but with a custom query language.
>>>>
>>>> Does that make sense from your perspective?
>>
>>
>> Hi Tobias,
>>
>> Interesting space and it would be great to have such a service.
>>
>> There are quite a few design choices to make and they can greatly influence
>> the desing.  For example: a service that offered replication etc and had
>> many datasets can be built using one dataset per machine as the unit.  It
>> scales in total data but not in data-per-dataset or graph.
>>
>> A service that specialised in massive data (more about data management than
>> raw query performance; maybe like a column store if aggregation queries
>> matter) if different from one giving as-near-real-time response for UIs
>> (basically, in-memory or the working set is in-memory).
>>
>> In terms of where to start,
>>
>> SDB if you are building on top of an SQL service
>>
>> TDB, or the shell of TDB, if you building on what amounts to a index
>> service.  TDB is built on top of indexes - you can plug in your own.
>>
>> I have built a TDB that used Project Voldemort as a block store for the TDB
>> B+Trees.  It worked quite well but as highly scalable base, it's limited as
>> too much work ends up on the query engine and not enough of the indexing
>> work access work is doe in the cluster.
>>
>> As for examples: see http://www.dydra.com/ which is SPARQL.
>>
>>        Andy
>>
>
>
>

Re: How to implement a custom JENA Backend

Posted by Tobias Neef <to...@gmail.com>.

Hi Andy,

thanks a lot for your insight. I agree with you that there is no free
cake. Most of the highly scalable stores have a very specific usage
scenario which is there sweet spot. Also the qualities of such a
service would depend on the mapping strategy you choose.

> I have built a TDB that used Project Voldemort as a block store for the TDB
> B+Trees.  It worked quite well but as highly scalable base, it's limited as
> too much work ends up on the query engine and not enough of the indexing
> work access work is doe in the cluster.

Not sure if I quite get your point there. What do you mean with "and
not enough of the indexing work access work is doe in the cluster"? Do
you mean that this architecture would be most suited for a read /
query intensive scenario rather than a frequent update one?

Your approach seems to be similar to some recently published approach,
which is the only research paper I have found in this area:
http://www.edbt.org/Proceedings/2012-Berlin/papers/workshops/danac2012/a4-bugiotti.pdf.

Is there a chance that I can take a look at your project code?

I know http://www.dydra.com/ but they haven't published anything yet
on how they manage their store. And the testing your can do seems to
be rather limited due to the beta constraints the currently have.


On Tue, Apr 24, 2012 at 8:56 PM, Andy Seaborne <an...@apache.org> wrote:
> On 24/04/12 14:57, Paolo Castagna wrote:
>>
>> Tobias Neef wrote:
>>>
>>> Hi Paolo,
>>>
>>> thanks for the quick response! The reason for doing this is, because I
>>> think it would be useful to have a RDF-Database with SPARQL-Interface
>>> which can be used as a PAAS offering like Amazon RDS or Amazon Dynamo
>>> DB: For the developer this would mean no hassle about replication, or
>>> scaling etc. To some extend you can achieve that when using Jena SDB
>>> on top of something like Amazon RDS or MS SQL Azure. I want to try how
>>> far I can get when I use Jena as API and map it to something like
>>> Dynamo DB or MS Azure Tables which have quite unique
>>> Scalability/Availability characteristics. There is for example
>>> http://datomic.com/ which also goes along those lines. They
>>> implemented it on top of Dynamo DB but with a custom query language.
>>>
>>> Does that make sense from your perspective?
>
>
> Hi Tobias,
>
> Interesting space and it would be great to have such a service.
>
> There are quite a few design choices to make and they can greatly influence
> the desing.  For example: a service that offered replication etc and had
> many datasets can be built using one dataset per machine as the unit.  It
> scales in total data but not in data-per-dataset or graph.
>
> A service that specialised in massive data (more about data management than
> raw query performance; maybe like a column store if aggregation queries
> matter) if different from one giving as-near-real-time response for UIs
> (basically, in-memory or the working set is in-memory).
>
> In terms of where to start,
>
> SDB if you are building on top of an SQL service
>
> TDB, or the shell of TDB, if you building on what amounts to a index
> service.  TDB is built on top of indexes - you can plug in your own.
>
> I have built a TDB that used Project Voldemort as a block store for the TDB
> B+Trees.  It worked quite well but as highly scalable base, it's limited as
> too much work ends up on the query engine and not enough of the indexing
> work access work is doe in the cluster.
>
> As for examples: see http://www.dydra.com/ which is SPARQL.
>
>        Andy
>

Re: How to implement a custom JENA Backend

Posted by Andy Seaborne <an...@apache.org>.

On 24/04/12 14:57, Paolo Castagna wrote:
> Tobias Neef wrote:
>> Hi Paolo,
>>
>> thanks for the quick response! The reason for doing this is, because I
>> think it would be useful to have a RDF-Database with SPARQL-Interface
>> which can be used as a PAAS offering like Amazon RDS or Amazon Dynamo
>> DB: For the developer this would mean no hassle about replication, or
>> scaling etc. To some extend you can achieve that when using Jena SDB
>> on top of something like Amazon RDS or MS SQL Azure. I want to try how
>> far I can get when I use Jena as API and map it to something like
>> Dynamo DB or MS Azure Tables which have quite unique
>> Scalability/Availability characteristics. There is for example
>> http://datomic.com/ which also goes along those lines. They
>> implemented it on top of Dynamo DB but with a custom query language.
>>
>> Does that make sense from your perspective?

Hi Tobias,

Interesting space and it would be great to have such a service.

There are quite a few design choices to make and they can greatly 
influence the desing.  For example: a service that offered replication 
etc and had many datasets can be built using one dataset per machine as 
the unit.  It scales in total data but not in data-per-dataset or graph.

A service that specialised in massive data (more about data management 
than raw query performance; maybe like a column store if aggregation 
queries matter) if different from one giving as-near-real-time response 
for UIs (basically, in-memory or the working set is in-memory).

In terms of where to start,

SDB if you are building on top of an SQL service

TDB, or the shell of TDB, if you building on what amounts to a index 
service.  TDB is built on top of indexes - you can plug in your own.

I have built a TDB that used Project Voldemort as a block store for the 
TDB B+Trees.  It worked quite well but as highly scalable base, it's 
limited as too much work ends up on the query engine and not enough of 
the indexing work access work is doe in the cluster.

As for examples: see http://www.dydra.com/ which is SPARQL.

	Andy

Re: How to implement a custom JENA Backend

Posted by Paolo Castagna <ca...@googlemail.com>.

Tobias Neef wrote:
> Hi Paolo,
> 
> thanks for the quick response! The reason for doing this is, because I
> think it would be useful to have a RDF-Database with SPARQL-Interface
> which can be used as a PAAS offering like Amazon RDS or Amazon Dynamo
> DB: For the developer this would mean no hassle about replication, or
> scaling etc. To some extend you can achieve that when using Jena SDB
> on top of something like Amazon RDS or MS SQL Azure. I want to try how
> far I can get when I use Jena as API and map it to something like
> Dynamo DB or MS Azure Tables which have quite unique
> Scalability/Availability characteristics. There is for example
> http://datomic.com/ which also goes along those lines. They
> implemented it on top of Dynamo DB but with a custom query language.
> 
> Does that make sense from your perspective?

Yep, interesting. Please, do share with us your finding if you are successful
with that. ;-)

... and let us know which DB in the cloud you choose.

Heroku has Postgres in their addons and maybe someone could even make money with
a SPARQL endpoint addons for Heroku (maybe, or maybe not).
Typically you need a lot of RAM and RAM is overpriced in the cloud(s)...

Paolo

> 
> On Tue, Apr 24, 2012 at 3:32 PM, Paolo Castagna
> <ca...@googlemail.com> wrote:
>> Hi Tobias,
>> I do not have general and useful hints other that looking at the SDB source code
>> if you are planning to develop a custom "backend" or a relational database or on
>> a storage system which offers you an SQL-like query language and look at the TDB
>> source code if you are planning to experiment with your own indexes on disk.
>>
>> I have a question for you, what's the reason why are you thinking to develop a
>> custom "Backend" for Jena?
>>
>> Thanks,
>> Paolo
>>
>> Tobias Neef wrote:
>>> Hi,
>>>
>>> I am currently thinking about developing a custom "Backend" for Jena
>>> just like the relational one and the native TDB. It would be great if
>>> you could give me some general hints how the process of developing
>>> such a backend would look like.
>>>
>>> Best regards,
>>> Tobias Neef

Re: How to implement a custom JENA Backend

Posted by Tobias Neef <to...@gmail.com>.

Hi Paolo,

thanks for the quick response! The reason for doing this is, because I
think it would be useful to have a RDF-Database with SPARQL-Interface
which can be used as a PAAS offering like Amazon RDS or Amazon Dynamo
DB: For the developer this would mean no hassle about replication, or
scaling etc. To some extend you can achieve that when using Jena SDB
on top of something like Amazon RDS or MS SQL Azure. I want to try how
far I can get when I use Jena as API and map it to something like
Dynamo DB or MS Azure Tables which have quite unique
Scalability/Availability characteristics. There is for example
http://datomic.com/ which also goes along those lines. They
implemented it on top of Dynamo DB but with a custom query language.

Does that make sense from your perspective?

On Tue, Apr 24, 2012 at 3:32 PM, Paolo Castagna
<ca...@googlemail.com> wrote:
> Hi Tobias,
> I do not have general and useful hints other that looking at the SDB source code
> if you are planning to develop a custom "backend" or a relational database or on
> a storage system which offers you an SQL-like query language and look at the TDB
> source code if you are planning to experiment with your own indexes on disk.
>
> I have a question for you, what's the reason why are you thinking to develop a
> custom "Backend" for Jena?
>
> Thanks,
> Paolo
>
> Tobias Neef wrote:
>> Hi,
>>
>> I am currently thinking about developing a custom "Backend" for Jena
>> just like the relational one and the native TDB. It would be great if
>> you could give me some general hints how the process of developing
>> such a backend would look like.
>>
>> Best regards,
>> Tobias Neef
>

Re: How to implement a custom JENA Backend

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Tobias,
I do not have general and useful hints other that looking at the SDB source code
if you are planning to develop a custom "backend" or a relational database or on
a storage system which offers you an SQL-like query language and look at the TDB
source code if you are planning to experiment with your own indexes on disk.

I have a question for you, what's the reason why are you thinking to develop a
custom "Backend" for Jena?

Thanks,
Paolo

Tobias Neef wrote:
> Hi,
> 
> I am currently thinking about developing a custom "Backend" for Jena
> just like the relational one and the native TDB. It would be great if
> you could give me some general hints how the process of developing
> such a backend would look like.
> 
> Best regards,
> Tobias Neef

Re: How to implement a custom JENA Backend

Posted by Paolo Castagna <ca...@googlemail.com>.

By the way, indeed, this could be a good advanced "How to" document to add to
the already existing Apache Jena documentation. ;-)

Making easy for developers to plugin different storage backends, different
parsers/serializers, different custom indexes (a la LARQ, GeoSPARQL, etc.) is,
IMHO, a valuable thing for a project such as Jena.

Paolo

Tobias Neef wrote:
> Hi,
> 
> I am currently thinking about developing a custom "Backend" for Jena
> just like the relational one and the native TDB. It would be great if
> you could give me some general hints how the process of developing
> such a backend would look like.
> 
> Best regards,
> Tobias Neef