You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Mirko Kämpf <mi...@gmail.com> on 2018/05/26 09:47:58 UTC

Fwd: Multiple Fuseki Servers in Distributed Environment

Hello Fuseki experts,

I want to ask you for your experience / thoughts about the following
approach:



In order to enable semantic queries over "trancient data" or on data which
is persisted in HDFS / HBase I
execute a Fuseki Server (standalone or embedded) on each cluster node,
which hosts a Spark Executor.

Since the data is partitioned I will not have references between the
datasets (in this particular case).

A simple query broker allows distributing the query and consolidation of
results. Next thing would be adding
a coordinator with graph statistics for optimization of data set dumps and
reloading in case of failure.

A load balancer is used to balance request and result flows towards
clients, eventually, the query broker will run in Docker.

A sketch is available here:
https://raw.githubusercontent.com/kamir/fuseki-cloud/master/
Fuseki%20Cloud.png



My initial prototype works well. Now I want go deeper. But I wonder, if
such an activity has already been started or if
you know reasons, why this is not a good approach.

In any case, if there is no reason for not implementing such a
"Fuseki-Cloud" approach - I continue on that route and
I want to contribute the results to the existing project.

Thanks for any hint or recommendation.

Best wishes,
Mirko

Re: Multiple Fuseki Servers in Distributed Environment

Posted by Dick Murray <da...@gmail.com>.
Apologies for resurrecting this thread...

Yes, it uses Thrift when distributed, ie multi JVM.

It was on hold because I changed jobs, yay!

I'm starting to look at making it available as a Jena side car, ie
jena-mosaic.

DickM

On 27 May 2018 at 12:02, ajs6f <aj...@apache.org> wrote:

> There are several systems that distribute SPARQL using Jena.
>
> Dick Murray has written a system called Mosaic that (I believe) uses
> Apache Thrift to distribute the lower-level (DatasetGraph) primitives that
> ARQ uses to execute SPARQL. An advantage over your plan might be that he
> isn't serializing full results over HTTP to pass them around. I don't
> understand that system to be ready for use outside of Dick's deployment,
> but he could say more.
>
> The SANSA project [1] has provided a system that I understand to use ARQ
> to execute queries over Apache Spark or Apache Flink. This sounds similar
> in some ways to what you are doing, and that system is available today. I
> think Jena committer Lorenz Bühmann is involved with that project; if I am
> correct, he may be able to say more.
>
> There are doubtless others about which I don't know.
>
> ajs6f
>
> [1] http://sansa-stack.net/
>
> > On May 26, 2018, at 5:47 AM, Mirko Kämpf <mi...@gmail.com> wrote:
> >
> > Hello Fuseki experts,
> >
> > I want to ask you for your experience / thoughts about the following
> > approach:
> >
> >
> >
> > In order to enable semantic queries over "trancient data" or on data
> which
> > is persisted in HDFS / HBase I
> > execute a Fuseki Server (standalone or embedded) on each cluster node,
> > which hosts a Spark Executor.
> >
> > Since the data is partitioned I will not have references between the
> > datasets (in this particular case).
> >
> > A simple query broker allows distributing the query and consolidation of
> > results. Next thing would be adding
> > a coordinator with graph statistics for optimization of data set dumps
> and
> > reloading in case of failure.
> >
> > A load balancer is used to balance request and result flows towards
> > clients, eventually, the query broker will run in Docker.
> >
> > A sketch is available here:
> > https://raw.githubusercontent.com/kamir/fuseki-cloud/master/
> > Fuseki%20Cloud.png
> >
> >
> >
> > My initial prototype works well. Now I want go deeper. But I wonder, if
> > such an activity has already been started or if
> > you know reasons, why this is not a good approach.
> >
> > In any case, if there is no reason for not implementing such a
> > "Fuseki-Cloud" approach - I continue on that route and
> > I want to contribute the results to the existing project.
> >
> > Thanks for any hint or recommendation.
> >
> > Best wishes,
> > Mirko
>
>

Re: Multiple Fuseki Servers in Distributed Environment

Posted by ajs6f <aj...@apache.org>.
There are several systems that distribute SPARQL using Jena.

Dick Murray has written a system called Mosaic that (I believe) uses Apache Thrift to distribute the lower-level (DatasetGraph) primitives that ARQ uses to execute SPARQL. An advantage over your plan might be that he isn't serializing full results over HTTP to pass them around. I don't understand that system to be ready for use outside of Dick's deployment, but he could say more.

The SANSA project [1] has provided a system that I understand to use ARQ to execute queries over Apache Spark or Apache Flink. This sounds similar in some ways to what you are doing, and that system is available today. I think Jena committer Lorenz Bühmann is involved with that project; if I am correct, he may be able to say more.

There are doubtless others about which I don't know.

ajs6f

[1] http://sansa-stack.net/

> On May 26, 2018, at 5:47 AM, Mirko Kämpf <mi...@gmail.com> wrote:
> 
> Hello Fuseki experts,
> 
> I want to ask you for your experience / thoughts about the following
> approach:
> 
> 
> 
> In order to enable semantic queries over "trancient data" or on data which
> is persisted in HDFS / HBase I
> execute a Fuseki Server (standalone or embedded) on each cluster node,
> which hosts a Spark Executor.
> 
> Since the data is partitioned I will not have references between the
> datasets (in this particular case).
> 
> A simple query broker allows distributing the query and consolidation of
> results. Next thing would be adding
> a coordinator with graph statistics for optimization of data set dumps and
> reloading in case of failure.
> 
> A load balancer is used to balance request and result flows towards
> clients, eventually, the query broker will run in Docker.
> 
> A sketch is available here:
> https://raw.githubusercontent.com/kamir/fuseki-cloud/master/
> Fuseki%20Cloud.png
> 
> 
> 
> My initial prototype works well. Now I want go deeper. But I wonder, if
> such an activity has already been started or if
> you know reasons, why this is not a good approach.
> 
> In any case, if there is no reason for not implementing such a
> "Fuseki-Cloud" approach - I continue on that route and
> I want to contribute the results to the existing project.
> 
> Thanks for any hint or recommendation.
> 
> Best wishes,
> Mirko