You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Mirko Kämpf <mi...@gmail.com> on 2018/05/26 09:47:58 UTC
Fwd: Multiple Fuseki Servers in Distributed Environment
Hello Fuseki experts,
I want to ask you for your experience / thoughts about the following
approach:
In order to enable semantic queries over "trancient data" or on data which
is persisted in HDFS / HBase I
execute a Fuseki Server (standalone or embedded) on each cluster node,
which hosts a Spark Executor.
Since the data is partitioned I will not have references between the
datasets (in this particular case).
A simple query broker allows distributing the query and consolidation of
results. Next thing would be adding
a coordinator with graph statistics for optimization of data set dumps and
reloading in case of failure.
A load balancer is used to balance request and result flows towards
clients, eventually, the query broker will run in Docker.
A sketch is available here:
https://raw.githubusercontent.com/kamir/fuseki-cloud/master/
Fuseki%20Cloud.png
My initial prototype works well. Now I want go deeper. But I wonder, if
such an activity has already been started or if
you know reasons, why this is not a good approach.
In any case, if there is no reason for not implementing such a
"Fuseki-Cloud" approach - I continue on that route and
I want to contribute the results to the existing project.
Thanks for any hint or recommendation.
Best wishes,
Mirko
Re: Multiple Fuseki Servers in Distributed Environment
Posted by Dick Murray <da...@gmail.com>.
Apologies for resurrecting this thread...
Yes, it uses Thrift when distributed, ie multi JVM.
It was on hold because I changed jobs, yay!
I'm starting to look at making it available as a Jena side car, ie
jena-mosaic.
DickM
On 27 May 2018 at 12:02, ajs6f <aj...@apache.org> wrote:
> There are several systems that distribute SPARQL using Jena.
>
> Dick Murray has written a system called Mosaic that (I believe) uses
> Apache Thrift to distribute the lower-level (DatasetGraph) primitives that
> ARQ uses to execute SPARQL. An advantage over your plan might be that he
> isn't serializing full results over HTTP to pass them around. I don't
> understand that system to be ready for use outside of Dick's deployment,
> but he could say more.
>
> The SANSA project [1] has provided a system that I understand to use ARQ
> to execute queries over Apache Spark or Apache Flink. This sounds similar
> in some ways to what you are doing, and that system is available today. I
> think Jena committer Lorenz Bühmann is involved with that project; if I am
> correct, he may be able to say more.
>
> There are doubtless others about which I don't know.
>
> ajs6f
>
> [1] http://sansa-stack.net/
>
> > On May 26, 2018, at 5:47 AM, Mirko Kämpf <mi...@gmail.com> wrote:
> >
> > Hello Fuseki experts,
> >
> > I want to ask you for your experience / thoughts about the following
> > approach:
> >
> >
> >
> > In order to enable semantic queries over "trancient data" or on data
> which
> > is persisted in HDFS / HBase I
> > execute a Fuseki Server (standalone or embedded) on each cluster node,
> > which hosts a Spark Executor.
> >
> > Since the data is partitioned I will not have references between the
> > datasets (in this particular case).
> >
> > A simple query broker allows distributing the query and consolidation of
> > results. Next thing would be adding
> > a coordinator with graph statistics for optimization of data set dumps
> and
> > reloading in case of failure.
> >
> > A load balancer is used to balance request and result flows towards
> > clients, eventually, the query broker will run in Docker.
> >
> > A sketch is available here:
> > https://raw.githubusercontent.com/kamir/fuseki-cloud/master/
> > Fuseki%20Cloud.png
> >
> >
> >
> > My initial prototype works well. Now I want go deeper. But I wonder, if
> > such an activity has already been started or if
> > you know reasons, why this is not a good approach.
> >
> > In any case, if there is no reason for not implementing such a
> > "Fuseki-Cloud" approach - I continue on that route and
> > I want to contribute the results to the existing project.
> >
> > Thanks for any hint or recommendation.
> >
> > Best wishes,
> > Mirko
>
>
Re: Multiple Fuseki Servers in Distributed Environment
Posted by ajs6f <aj...@apache.org>.
There are several systems that distribute SPARQL using Jena.
Dick Murray has written a system called Mosaic that (I believe) uses Apache Thrift to distribute the lower-level (DatasetGraph) primitives that ARQ uses to execute SPARQL. An advantage over your plan might be that he isn't serializing full results over HTTP to pass them around. I don't understand that system to be ready for use outside of Dick's deployment, but he could say more.
The SANSA project [1] has provided a system that I understand to use ARQ to execute queries over Apache Spark or Apache Flink. This sounds similar in some ways to what you are doing, and that system is available today. I think Jena committer Lorenz Bühmann is involved with that project; if I am correct, he may be able to say more.
There are doubtless others about which I don't know.
ajs6f
[1] http://sansa-stack.net/
> On May 26, 2018, at 5:47 AM, Mirko Kämpf <mi...@gmail.com> wrote:
>
> Hello Fuseki experts,
>
> I want to ask you for your experience / thoughts about the following
> approach:
>
>
>
> In order to enable semantic queries over "trancient data" or on data which
> is persisted in HDFS / HBase I
> execute a Fuseki Server (standalone or embedded) on each cluster node,
> which hosts a Spark Executor.
>
> Since the data is partitioned I will not have references between the
> datasets (in this particular case).
>
> A simple query broker allows distributing the query and consolidation of
> results. Next thing would be adding
> a coordinator with graph statistics for optimization of data set dumps and
> reloading in case of failure.
>
> A load balancer is used to balance request and result flows towards
> clients, eventually, the query broker will run in Docker.
>
> A sketch is available here:
> https://raw.githubusercontent.com/kamir/fuseki-cloud/master/
> Fuseki%20Cloud.png
>
>
>
> My initial prototype works well. Now I want go deeper. But I wonder, if
> such an activity has already been started or if
> you know reasons, why this is not a good approach.
>
> In any case, if there is no reason for not implementing such a
> "Fuseki-Cloud" approach - I continue on that route and
> I want to contribute the results to the existing project.
>
> Thanks for any hint or recommendation.
>
> Best wishes,
> Mirko