You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2012/05/18 18:35:59 UTC

A fault-tolerant and replicated data publishing solution (by Epimorphics)... and how to calculate the triples to add/remove?

Hi,
I've just read this blog post from Andy:
http://www.epimorphics.com/web/wiki/epimorphics-builds-data-publish-platform-environment-agency

It describes a "quite simple" fault-tolerant and replicated data publishing solution using Apache Jena and Fuseki. Interesting.

It's a master/slave architecture. The master (called by Andy in his post 'controller server') receives all updates and "calculates the triples to be added, the triples to be removed" so that changes
are 'idempotent' (i.e. they can be reapplied multiple times (in the same order!) with the same effect).

It would be interesting to know if the 'controller server' exposes a full SPARQL Update endpoint and/or the Graph Store HTTP Protocol and if that is the case how triples to be added/removed are
calculated. (This is something I wanted to learn for a while, but I still did not find the time... a small example would be wonderful! ;-)).

To conclude, I fully agree on the "quite simple design" and "simple systems are easier to operate". The approach described can work well in a lot of scenarios where the rate of updates/writes isn't
excessive and you have mostly reads (which I still believe to be the case most of the times when you have RDF data, since data is often human generated/curated data).
My hope is to see something similar in the 'open' so that other Apache Jena and Fuseki users can benefit from an highly available and open source publishing solution for RDF data (and they can focus
their energies/efforts elsewhere: on the quality of their data modeling, data, applications, user experience, etc.).

Paolo

PS:
Disclaimer: I don't work for Epimorphics, those are just my personal opinions and, last but not least, I love simplicity.

Re: A fault-tolerant and replicated data publishing solution (by Epimorphics)... and how to calculate the triples to add/remove?

Posted by Andy Seaborne <an...@apache.org>.

On 18/05/12 17:35, Paolo Castagna wrote:
> Hi, I've just read this blog post from Andy:
> http://www.epimorphics.com/web/wiki/epimorphics-builds-data-publish-platform-environment-agency
>
>  It describes a "quite simple" fault-tolerant and replicated data
> publishing solution using Apache Jena and Fuseki. Interesting.
>
> It's a master/slave architecture. The master (called by Andy in his
> post 'controller server') receives all updates and "calculates the
> triples to be added, the triples to be removed" so that changes are
> 'idempotent' (i.e. they can be reapplied multiple times (in the same
> order!) with the same effect).
>
> It would be interesting to know if the 'controller server' exposes a
> full SPARQL Update endpoint and/or the Graph Store HTTP Protocol and
> if that is the case how triples to be added/removed are calculated.
> (This is something I wanted to learn for a while, but I still did not
> find the time... a small example would be wonderful! ;-)).
>
> To conclude, I fully agree on the "quite simple design" and "simple
> systems are easier to operate". The approach described can work well
> in a lot of scenarios where the rate of updates/writes isn't
> excessive and you have mostly reads (which I still believe to be the
> case most of the times when you have RDF data, since data is often
> human generated/curated data). My hope is to see something similar in
> the 'open' so that other Apache Jena and Fuseki users can benefit
> from an highly available and open source publishing solution for RDF
> data (and they can focus their energies/efforts elsewhere: on the
> quality of their data modeling, data, applications, user experience,
> etc.).
>
> Paolo
>
> PS: Disclaimer: I don't work for Epimorphics, those are just my
> personal opinions and, last but not least, I love simplicity.

The controller is not the same as the replicas with a copy of the DB 
that gets updated and propagated. It takes the changes (the data updates 
are CSV), converts to RDF and calculates the adds and deletes.   This 
process produces DELELE DATA ... INSERT DATA ... and the script to reset 
the current view.  It's not general SPARQL Update and not master/slave 
as such.

Think of it as a design pattern.

A different pattern for more general updates, assuming fail-stop nodes 
and no partitions, with briefly holding up updates at a point where a 
server is being introduced would also be possible.  Keeping a "last 
transaction" id would help but I'm not sure it's necessary.  Useful for 
a few machines, but the restrictions become a burden as the number goes 
up at which point a more complicated design would be more useful. 
Again, this is for a system that is mainly publishing, with some 
updates, not a system with a high number and proportion of updates, and 
where absolute consistency isn't needed.

Horses for courses.

	Andy