You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Jonathan MERCIER <jo...@microbiome.studio> on 2023/01/02 12:25:33 UTC

How to deploy a scalable SPARQL Jena service ?

Dear Jena developers and community,

I am looking to deploy an efficient SPARQL server with Jena instead of 
the free version of GraphDB.
And after a look to the documentation: https://jena.apache.org 
<https://jena.apache.org/>
I would like to know:
1) can we use Jena applications stack with quarkus 
(<https://quarkus.io/about/>) in order to be used efficiently on 
kubernetes server ?

2) can we  deploy a distributed TDB service, in order to have efficient 
query ?
Indeed onto the w3c wiki (<https://www.w3.org/wiki/LargeTripleStores>) 
they describes that graphdb isat least 5 times faster that Jena .
So can we speedup by providing a good environnement as quarkus, 
elasticsearch, cluster db ...

Thanks

Best regards

Re: How to deploy a scalable SPARQL Jena service ?

Posted by Jonathan MERCIER <jo...@microbiome.studio>.


>> Hi Andy,
>> 
>>> Could you say somnthing about the usage patterns you are interested 
>>> in supporting? Size of data? Query load?
> 
> Shiro will do the authentication and API security for authorization.
> 
> To get the access control on parts of the overall data, do you split 
> the data into separate triplestores? Do you use the per-graph access 
> control of Jena to get data level security?
> 
> The per-graph access control works if (1) you can manage the data 
> that way with named graphs and (2) the access control is user, or 
> role, based.

I Think we will use both dataset and named graph to control data access.
My main problem here is:
1. the documentation of apache apche shiro-Jena is more close to at 
devloper level than user level.
2. How to combine Keycloack (our global IAM) with Shiro, as we have 
multiple internal services and multiple external organization. We use 
Keycloack and AD/LDAP group to manage their roles.

Re: How to deploy a scalable SPARQL Jena service ?

Posted by Nicholas Car <ni...@kurrawong.net>.

Hi Luis,

Why don't you drop me a line on the email below and I'll respond directly?

Cheers, Nick


------- Original Message -------
On Monday, January 9th, 2023 at 16:29, Luis Enrique Ramos García <lu...@googlemail.com.INVALID> wrote:


> Dear Nicolas,
> 
> I would be interested in getting more information about your approach,
> should I contact you directly,
> 
> or could you provide such information by this way?.
> 
> 
> Best regards
> 
> 
> Luis Ramos
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> 
> El lun, 9 ene 2023 a las 6:51, Nicholas Car (nick@kurrawong.net) escribió:
> 
> > In case readers of this tread, or the list generally, are interested, we
> > are testing out a virtual graph access control system that works nicely
> > with Jena/Fuseki. We create Virtual Graphs that are Named Graphs with no
> > content but are closures of other Named Graphs that do hold content. In
> > this way, we can implement fancy access control - multiple users, groups
> > and roles - to small graph parts, using just standard quad store elements +
> > administration data holdings.
> > 
> > So here you would break the larger graph into a Named Graph per governance
> > unit - whatever your smallest conception of that is - and then build back
> > up access to multiple Named Graphs via Virtual Graphs. All done in Fuseki
> > back-end + access control API.
> > 
> > Happy to share more details if anyone in interested here or directly.
> > 
> > Cheers, Nick
> > 
> > --
> > Dr Nicholas Car
> > Data Architect & Knowledge Graph Specialist
> > Kurrawong AI
> > nick@kurrawong.net
> > 0477 560 177
> > https://kurrawong.net
> > 
> > Honorary Lecturer
> > College of Engineering, Computing & Cybernetics
> > Australian National University
> > https://cecc.anu.edu.au/people/nicholas-car
> > 
> > --
> > 
> > ------- Original Message -------
> > On Monday, January 9th, 2023 at 07:01, Andy Seaborne andy@apache.org
> > wrote:
> > 
> > > On 06/01/2023 15:37, Jonathan MERCIER wrote:
> > > 
> > > > > Hi Jonathan,
> > > > 
> > > > Hi Andy,
> > > > 
> > > > > Could you say somnthing about the usage patterns you are interested
> > > > > in
> > > > > supporting? Size of data? Query load?
> > > > 
> > > > Yes of course, we aims to store Partially uniprot ontology in order to
> > > > study metabolism on multiple layer
> > > > Organism/Gene/Protein/Reaction/Pathway.
> > > > Thus we will get a huge amount of public and private data (both
> > > > academic
> > > > research and industrial).
> > > > So we have to use apache shiro to contol who can acces some data (by
> > > > tenant)
> > > 
> > > Shiro will do the authentication and API security for authorization.
> > > 
> > > To get the access control on parts of the overall data, do you split the
> > > data into separate triplestores? Do you use the per-graph access control
> > > of Jena to get data level security?
> > > 
> > > The per-graph access control works if (1) you can manage the data that
> > > way with named graphs and (2) the access control is user, or role, based.
> > > 
> > > In dayjob, I'm working on another data access control system - we have
> > > existing data which does not decompose into named graphs very easily and
> > > the access control rules don't fit user/role bases (Role Based Access
> > > Control = RBAC).
> > > 
> > > Attribute Based Access Control (ABAC) can go down to labelling the
> > > access conditions on individual triples - and also provides of simple
> > > triple pattern matching (because sometimes, many triples have the same
> > > label e.g. they have the same property).
> > > 
> > > The "attribute" part comes from having key/value boolean expressions for
> > > access conditions, such as "department=engineering & status=employee"
> > > which can be moved around with the data when sharing across enterprise
> > > boundaries.
> > > 
> > > > Currently size of data is estimated around 1 To
> > > > We will provides a Knowledge release time to time so we will most of
> > > > time doing read only query and sometime we will push our new release (1
> > > > To).
> > > 
> > > Then the full capabilities of RDF Delta may not be needed. Sounds like
> > > offline database build, copy DB to multiple triple stores behind a load
> > > balancer.
> > > 
> > > Full 24x7 update with no single point of failure is nice but it is
> > > complex. More servers (cost), more admin (more cost!).
> > > 
> > > Or for a few not-time critical incremental updates, a simple mode for
> > > RDF Delta is with a single patch manager with a replicated filesystem.
> > > This is a single point of failure for updates, but the Fuseki replicas
> > > can provide query service through-out. It is simpler to operate.
> > > 
> > > Andy
> > > 
> > > > > There is a Lucene based text index.
> > > > > Indeed I see this I will take a look, on how to enable lucene with
> > > > > TDB
> > > > 
> > > > Also we will take a look to the fuseki API in order to be able to use
> > > > it
> > > > through our python application (more rarely Kotlin)
> > > > 
> > > > We aims to perform some GeoSpatial query (maybe we would have to make a
> > > > plugin) in order to have a dedicated algorithm to walk though our
> > > > knowledge graph
> > > > 
> > > > > 2) can we deploy a distributed TDB service, in order to have
> > > > > efficient
> > > > > query ?
> > > > > 
> > > > > It can scale sideways with multiple copies of the database kept
> > > > > consistent across a cluster of replicas using the separate project
> > > > > (it
> > > > > is not an Apache Foundation project) that provides high availability
> > > > > and multiple query
> > > > > 
> > > > > RDF Delta https://afs.github.io/rdf-delta
> > > > > Thanks Andy I will take a look

Re: How to deploy a scalable SPARQL Jena service ?

Posted by Luis Enrique Ramos García <lu...@googlemail.com.INVALID>.

Dear Nicolas,

I would be interested in getting more information about your approach,
should I contact you directly,

or could you provide such information by this way?.


Best regards


Luis Ramos













El lun, 9 ene 2023 a las 6:51, Nicholas Car (<ni...@kurrawong.net>) escribió:

> In case readers of this tread, or the list generally, are interested, we
> are testing out a virtual graph access control system that works nicely
> with Jena/Fuseki. We create Virtual Graphs that are Named Graphs with no
> content but are closures of other Named Graphs that do hold content. In
> this way, we can implement fancy access control - multiple users, groups
> and roles - to small graph parts, using just standard quad store elements +
> administration data holdings.
>
> So here you would break the larger graph into a Named Graph per governance
> unit - whatever your smallest conception of that is - and then build back
> up access to multiple Named Graphs via Virtual Graphs. All done in Fuseki
> back-end + access control API.
>
> Happy to share more details if anyone in interested here or directly.
>
> Cheers, Nick
>
> --
> Dr Nicholas Car
> Data Architect & Knowledge Graph Specialist
> Kurrawong AI
> nick@kurrawong.net
> 0477 560 177
> https://kurrawong.net
>
> Honorary Lecturer
> College of Engineering, Computing & Cybernetics
> Australian National University
> https://cecc.anu.edu.au/people/nicholas-car
>
> --
>
>
> ------- Original Message -------
> On Monday, January 9th, 2023 at 07:01, Andy Seaborne <an...@apache.org>
> wrote:
>
>
> >
> > On 06/01/2023 15:37, Jonathan MERCIER wrote:
> >
> > > > Hi Jonathan,
> > >
> > > Hi Andy,
> > >
> > > > Could you say somnthing about the usage patterns you are interested
> in
> > > > supporting? Size of data? Query load?
> > >
> > > Yes of course, we aims to store Partially uniprot ontology in order to
> > > study metabolism on multiple layer
> Organism/Gene/Protein/Reaction/Pathway.
> > > Thus we will get a huge amount of public and private data (both
> academic
> > > research and industrial).
> > > So we have to use apache shiro to contol who can acces some data (by
> > > tenant)
> >
> >
> > Shiro will do the authentication and API security for authorization.
> >
> > To get the access control on parts of the overall data, do you split the
> > data into separate triplestores? Do you use the per-graph access control
> > of Jena to get data level security?
> >
> > The per-graph access control works if (1) you can manage the data that
> > way with named graphs and (2) the access control is user, or role, based.
> >
> > In dayjob, I'm working on another data access control system - we have
> > existing data which does not decompose into named graphs very easily and
> > the access control rules don't fit user/role bases (Role Based Access
> > Control = RBAC).
> >
> > Attribute Based Access Control (ABAC) can go down to labelling the
> > access conditions on individual triples - and also provides of simple
> > triple pattern matching (because sometimes, many triples have the same
> > label e.g. they have the same property).
> >
> > The "attribute" part comes from having key/value boolean expressions for
> > access conditions, such as "department=engineering & status=employee"
> > which can be moved around with the data when sharing across enterprise
> > boundaries.
> >
> > > Currently size of data is estimated around 1 To
> > > We will provides a Knowledge release time to time so we will most of
> > > time doing read only query and sometime we will push our new release (1
> > > To).
> >
> >
> > Then the full capabilities of RDF Delta may not be needed. Sounds like
> > offline database build, copy DB to multiple triple stores behind a load
> > balancer.
> >
> > Full 24x7 update with no single point of failure is nice but it is
> > complex. More servers (cost), more admin (more cost!).
> >
> > Or for a few not-time critical incremental updates, a simple mode for
> > RDF Delta is with a single patch manager with a replicated filesystem.
> > This is a single point of failure for updates, but the Fuseki replicas
> > can provide query service through-out. It is simpler to operate.
> >
> > Andy
> >
> > > > There is a Lucene based text index.
> > > > Indeed I see this I will take a look, on how to enable lucene with
> TDB
> > >
> > > Also we will take a look to the fuseki API in order to be able to use
> it
> > > through our python application (more rarely Kotlin)
> > >
> > > We aims to perform some GeoSpatial query (maybe we would have to make a
> > > plugin) in order to have a dedicated algorithm to walk though our
> > > knowledge graph
> > >
> > > > 2) can we deploy a distributed TDB service, in order to have
> efficient
> > > >  query ?
> > > >
> > > > It can scale sideways with multiple copies of the database kept
> > > > consistent across a cluster of replicas using the separate project
> (it
> > > > is not an Apache Foundation project) that provides high availability
> > > > and multiple query
> > > >
> > > > RDF Delta https://afs.github.io/rdf-delta
> > > > Thanks Andy I will take a look
>

Re: How to deploy a scalable SPARQL Jena service ?

Posted by Nicholas Car <ni...@kurrawong.net>.

In case readers of this tread, or the list generally, are interested, we are testing out a virtual graph access control system that works nicely with Jena/Fuseki. We create Virtual Graphs that are Named Graphs with no content but are closures of other Named Graphs that do hold content. In this way, we can implement fancy access control - multiple users, groups and roles - to small graph parts, using just standard quad store elements + administration data holdings.

So here you would break the larger graph into a Named Graph per governance unit - whatever your smallest conception of that is - and then build back up access to multiple Named Graphs via Virtual Graphs. All done in Fuseki back-end + access control API.

Happy to share more details if anyone in interested here or directly.

Cheers, Nick

--
Dr Nicholas Car
Data Architect & Knowledge Graph Specialist
Kurrawong AI
nick@kurrawong.net
0477 560 177
https://kurrawong.net

Honorary Lecturer
College of Engineering, Computing & Cybernetics
Australian National University
https://cecc.anu.edu.au/people/nicholas-car

--


------- Original Message -------
On Monday, January 9th, 2023 at 07:01, Andy Seaborne <an...@apache.org> wrote:


> 
> On 06/01/2023 15:37, Jonathan MERCIER wrote:
> 
> > > Hi Jonathan,
> > 
> > Hi Andy,
> > 
> > > Could you say somnthing about the usage patterns you are interested in
> > > supporting? Size of data? Query load?
> > 
> > Yes of course, we aims to store Partially uniprot ontology in order to
> > study metabolism on multiple layer Organism/Gene/Protein/Reaction/Pathway.
> > Thus we will get a huge amount of public and private data (both academic
> > research and industrial).
> > So we have to use apache shiro to contol who can acces some data (by
> > tenant)
> 
> 
> Shiro will do the authentication and API security for authorization.
> 
> To get the access control on parts of the overall data, do you split the
> data into separate triplestores? Do you use the per-graph access control
> of Jena to get data level security?
> 
> The per-graph access control works if (1) you can manage the data that
> way with named graphs and (2) the access control is user, or role, based.
> 
> In dayjob, I'm working on another data access control system - we have
> existing data which does not decompose into named graphs very easily and
> the access control rules don't fit user/role bases (Role Based Access
> Control = RBAC).
> 
> Attribute Based Access Control (ABAC) can go down to labelling the
> access conditions on individual triples - and also provides of simple
> triple pattern matching (because sometimes, many triples have the same
> label e.g. they have the same property).
> 
> The "attribute" part comes from having key/value boolean expressions for
> access conditions, such as "department=engineering & status=employee"
> which can be moved around with the data when sharing across enterprise
> boundaries.
> 
> > Currently size of data is estimated around 1 To
> > We will provides a Knowledge release time to time so we will most of
> > time doing read only query and sometime we will push our new release (1
> > To).
> 
> 
> Then the full capabilities of RDF Delta may not be needed. Sounds like
> offline database build, copy DB to multiple triple stores behind a load
> balancer.
> 
> Full 24x7 update with no single point of failure is nice but it is
> complex. More servers (cost), more admin (more cost!).
> 
> Or for a few not-time critical incremental updates, a simple mode for
> RDF Delta is with a single patch manager with a replicated filesystem.
> This is a single point of failure for updates, but the Fuseki replicas
> can provide query service through-out. It is simpler to operate.
> 
> Andy
> 
> > > There is a Lucene based text index.
> > > Indeed I see this I will take a look, on how to enable lucene with TDB
> > 
> > Also we will take a look to the fuseki API in order to be able to use it
> > through our python application (more rarely Kotlin)
> > 
> > We aims to perform some GeoSpatial query (maybe we would have to make a
> > plugin) in order to have a dedicated algorithm to walk though our
> > knowledge graph
> > 
> > > 2) can we deploy a distributed TDB service, in order to have efficient
> > > query ?
> > > 
> > > It can scale sideways with multiple copies of the database kept
> > > consistent across a cluster of replicas using the separate project (it
> > > is not an Apache Foundation project) that provides high availability
> > > and multiple query
> > > 
> > > RDF Delta https://afs.github.io/rdf-delta
> > > Thanks Andy I will take a look

Re: How to deploy a scalable SPARQL Jena service ?

Posted by Andy Seaborne <an...@apache.org>.

On 06/01/2023 15:37, Jonathan MERCIER wrote:
>> Hi Jonathan,
> 
> Hi Andy,
> 
>> Could you say somnthing about the usage patterns you are interested in 
>> supporting? Size of data? Query load?
> 
> Yes of course, we aims to store Partially uniprot ontology in order to 
> study metabolism on multiple layer Organism/Gene/Protein/Reaction/Pathway.
> Thus we will get a huge amount of public and private data (both academic 
> research and industrial).
> So we have to use apache shiro to contol who can acces some data (by 
> tenant)

Shiro will do the authentication and API security for authorization.

To get the access control on parts of the overall data, do you split the 
data into separate triplestores? Do you use the per-graph access control 
of Jena to get data level security?

The per-graph access control works if (1) you can manage the data that 
way with named graphs and (2) the access control is user, or role, based.

In dayjob, I'm working on another data access control system - we have 
existing data which does not decompose into named graphs very easily and 
the access control rules don't fit user/role bases (Role Based Access 
Control = RBAC).

Attribute Based Access Control (ABAC) can go down to labelling the 
access conditions on individual triples - and also provides of simple 
triple pattern matching (because sometimes, many triples have the same 
label e.g. they have the same property).

The "attribute" part comes from having key/value boolean expressions for 
access conditions, such as "department=engineering & status=employee" 
which can be moved around with the data when sharing across enterprise 
boundaries.

> 
> Currently size of data is estimated around 1 To
> We will provides a Knowledge release time to time so we will most of 
> time doing read only query and sometime we will push our new release (1 
> To).

Then the full capabilities of RDF Delta may not be needed. Sounds like 
offline database build, copy DB to multiple triple stores behind a load 
balancer.

Full 24x7 update with no single point of failure is nice but it is 
complex. More servers (cost), more admin (more cost!).

Or for a few not-time critical incremental updates, a simple mode for 
RDF Delta is with a single patch manager with a replicated filesystem.
This is a single point of failure for updates, but the Fuseki replicas 
can provide query service through-out. It is simpler to operate.

     Andy

>> There is a Lucene based text index.
> Indeed I see this I will take a look, on how to enable lucene with TDB
> 
> Also we will take a look to the fuseki API in order to be able to use it 
> through our python application (more rarely Kotlin)
> 
> We aims to perform some GeoSpatial query (maybe we would have to make a 
> plugin) in order to have a dedicated algorithm to walk though our 
> knowledge graph
>> 2) can we  deploy a distributed TDB service, in order to have efficient 
>> query ?
>>
>> It can scale sideways with multiple copies of the database kept 
>> consistent across a cluster of replicas using the separate project (it 
>> is not an Apache Foundation project) that provides high availability 
>> and multiple query
>>
>> RDF Delta <https://afs.github.io/rdf-delta>
> Thanks Andy I will take a look
> 
> 
>

Re: How to deploy a scalable SPARQL Jena service ?

Posted by Jonathan MERCIER <jo...@microbiome.studio>.

> Hi Jonathan,

Hi Andy,

> Could you say somnthing about the usage patterns you are interested 
> in supporting? Size of data? Query load?

Yes of course, we aims to store Partially uniprot ontology in order to 
study metabolism on multiple layer 
Organism/Gene/Protein/Reaction/Pathway.
Thus we will get a huge amount of public and private data (both 
academic research and industrial).
So we have to use apache shiro to contol who can acces some data (by 
tenant)

Currently size of data is estimated around 1 To
We will provides a Knowledge release time to time so we will most of 
time doing read only query and sometime we will push our new release (1 
To).

> There is a Lucene based text index.
Indeed I see this I will take a look, on how to enable lucene with TDB

Also we will take a look to the fuseki API in order to be able to use 
it through our python application (more rarely Kotlin)

We aims to perform some GeoSpatial query (maybe we would have to make a 
plugin) in order to have a dedicated algorithm to walk though our 
knowledge graph
2) can we  deploy a distributed TDB service, in order to have efficient 
query ?
> 
> It can scale sideways with multiple copies of the database kept 
> consistent across a cluster of replicas using the separate project 
> (it is not an Apache Foundation project) that provides high 
> availability and multiple query
> 
> RDF Delta
> <https://afs.github.io/rdf-delta>
Thanks Andy I will take a look

Re: How to deploy a scalable SPARQL Jena service ?

Posted by Andy Seaborne <an...@apache.org>.

Hi Jonathan,

Could you say somnthing about the usage patterns you are interested in 
supporting? Size of data? Query load?

On 02/01/2023 12:25, Jonathan MERCIER wrote:
> Dear Jena developers and community,
> 
> I am looking to deploy an efficient SPARQL server with Jena instead of 
> the free version of GraphDB.
> And after a look to the documentation: https://jena.apache.org 
> <https://jena.apache.org/>
> I would like to know:
> 1) can we use Jena applications stack with quarkus 
> (<https://quarkus.io/about/>) in order to be used efficiently on 
> kubernetes server ?

The triplestore from the Jena project is called Fuseki.
There is a Lucene based text index.

There is a Dockerfile as a basis for customize to have the container to 
to deploy.

There isn't anything from the Jena project specific to Quarkus but maybe 
some user here can help. And Fuseki can have extension modules to add 
functionality around the core engine.

> 2) can we  deploy a distributed TDB service, in order to have efficient 
> query ?

It can scale sideways with multiple copies of the database kept 
consistent across a cluster of replicas using the separate project (it 
is not an Apache Foundation project) that provides high availability and 
multiple query

RDF Delta
https://afs.github.io/rdf-delta

    Andy

> Indeed onto the w3c wiki (<https://www.w3.org/wiki/LargeTripleStores>) 
> they describes that graphdb isat least 5 times faster that Jena .
> So can we speedup by providing a good environnement as quarkus, 
> elasticsearch, cluster db ...
> 
> Thanks
> 
> Best regards
> 
> 
> 
>