You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2022/07/28 09:43:56 UTC

About JENA-2339 - security related

JENA-2339
PR#1441
https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md

tl;dr:

It is a different role for Fuseki.

Fuseki execute the security but the setup and control is from a trusted
external server on the request execution path.

It assumes certain deployment environments to be safe.

My feeling is that we should make Fuseki configurable enough so that a
downstream 3rd party can add their security solution that is suitable
for their environment. But we should not incorporate a particular
security solution that relies on the deployment environment.

----

I've asked for more information about the claim on a performance
motivator and some other background information.

The usage patterns are not yet clear. The data is described as "a one
graph per handful of subjects and their properties" and "100s of
graphs". What the queries are is unstated.

There is no characterisation of the queries being made. If we are
talking about overheads, the cases of a few big queries and many small
queries are different.

The scale looks small (less than a million triples of triples -
approximating as 100 graphs * 1000 triples). That makes the point about
access to TDB hooks a bit redundant.

There is are distinguished users. A request from one of these users
causes the set of visible graphs to be read from a comment at the start
of the query text in the request.

The use of large numbers of small named graphs to manage security
settings looks to me like triple-level security. I have already
mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
is triple level attribute-based security.

Concern 1:

This by passes Fuseki-provided security and puts the control function
outside the Fuseki server in a separate server that is not part of Jena.
It will only be secure if deployed in a constrained network environment.

This is not secure except when run in a certain way and, personally, I
don't want to have to deal with a CVE because of that. CVE handling is
time consuming.

I don't see why it is using jena-access (the named graph security
feature) except for the filtering on TDB. It is creating a dynamic
dataset for the query.

Concern 2: How does update fit into the picture? (GSP is not supported).

Concern 3: It looks like a specific solution for a specific scenario.
Will it get uptake by the wide Jena user community?

Concern 4: Is there long-term support and maintenance for the feature?
(e.g. 5y+)
How do we respond to users@ message about it? Is it experimental code or
has it been used for real? Is the feature set stable?

Opinion: it is not unreasonable to provide support for this kind of
customization of Fuseki.

An extension can then provide whatever security is needed for the
situation and it is the Fuseki user/operator making the decisions about
what is acceptable security and what isn't.

Fuseki has ways to add custom processors and this seems the way to
provide an alternative way to make queries.

Putting it in the distribution codebase is a big step for the project.
At the very least, it needs to be mature and likely to be used.

Background: Currently jena-access is in Fuseki main. It is not optional
because it predates Fuseki modules.

Andy

Re: About JENA-2339 - security related

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

On Mon, 8 Aug 2022 at 18.06, Vilnis Termanis
<Vi...@iotics.com.invalid> wrote:

> On Mon, 1 Aug 2022 at 12:29, Andy Seaborne <an...@apache.org> wrote:
> >
> >
> >
> > On 28/07/2022 20:50, Vilnis Termanis wrote:
> > > Hi Andy & Jena development community,
> > >
> > > (Answers inline - apologies if I repeat myself)
> > >
> > > FYI - Our aim is to enable end-users to make SPARQL queries whilst
> > > respecting visibility restrictions.
> > > I.e. users (indirectly) add sets of related triples to a dataset and
> > > they can choose who has visibility (beyond themselves) over these,
> > > either: Nobody, Everyone or a chosen set (which can be updated). Note
> > > that this restriction is not by a specific subject or predicate.
> > > (Although the sets of triples do have relationships - not all of them
> > > are known in advance.)
> >
> > Let's clarify terminology here.
> >
> > A "Jena user" is a person or organisation that is downloading Jena,
> > either as the formal release (source code) or convenience binaries (e.g.
> > jars from Maven Central). The "convenience binaries" is the more usual
> case.
> >
> > Not Iotics users. Systems built with Jena have their own users.
> > (The Apache License applies - including clause 7.)
> >
> > The responsibility is between the downstream system builder and their
> > users of product or service being "fit for purpose".
>
> Sorry about that - I should have been clearer with the terms.
>
> In the submission - there is only one entity - the "Fuseki user" (e.g.
> via BasicAuth) to which the dynamic mode applies. However, since this
> is intended to be used a part of an integration (by Jena users - to
> gate access to their own domain-specific end-users), the
> authentication bit I think is irrelevant. (E.g. a separate service
> endpoint could have the proposed functionality enabled and this is
> what the integration calls.)
>
> >
> > > using a "SELECT {} 1" query, and
> > > adding a certain set of graphs makes the queries on my laptop take:
> > > ~600 graphs ~115ms
> > > ~1500 graphs ~162ms
> > > ~3k graphs ~240ms
> > > ~6k graphs ~400ms
> >
> > That's an illustration of the current system but we don't know what is
> > the cause of the cost.
> >
> > What piece of the code is taking the time?
> > Maybe the right thing to do is make it faster.
>
> I haven't looked into this in great detail, but from my understanding
> the time taken is a combination of a) parsing the input of allowed
> graphs and b) generating a new SecurityContext (holding a hashmap of
> said graphs). If providing a set of allowed graphs in the proposed way
> is not a no-go, I'm happy to dig into where the cost is exactly.
>
> >
> > And in the general area - what are you using for authentication?
> >
>
> For us right now, we're only using fuseki:auth "basic" for the
> purposes of differentiating different access levels against Fuseki
> Data Access Control configuration (by mapping those to Fuseki users),
> e.g.:
> Fuseki user1 => allowed to see graphs A & B
> Fuseki user2 => allowed to see graphs B & C
> Fuseki user3 => has the proposed feature dynamic-access feature
> enabled (i.e. no access unless the pragma preamble exists in query
> with 1+ graphs defined)
>
> Said Fuseki users (=roles) are then chosen based on what the system
> needs to do (domain-specific).
>
> > There is some bearer auth support in the next release ... it does not
> > provide complete bearer auth because it can't cover all cases (e.g. JWT
> > validation). It is more of a framework template with which to build a
> > local solution.
>
> I'm showing my lack of JWT/Bearer auth knowledge - but is this
> building block for what Martynas suggested, namely the token implies
> the user to which dynamic ACL applies and then access can be
> restricted e.g. via WACL/Solid? (Correct me if I'm wrong but is  this
> still not a solution that involves ACL rules being stored in Jena or
> at least be accessible via SPARQL for a SERVICE call?)
>

LinkedDataHub identifies agents with URIs, which can be called WebIDs [1].
Currently it supports WebID-TLS and OIDC with JWT tokens as authentication
protocols. Authorization is checked using WAC as mentioned earlier.

We use 2 Fuseki endpoints for each webapp instance: “end-user” and “admin”.
The auth queries federate between them using SERVICE. Sandboxing them might
be a little tricky, but in general it has worked well and did not require
any new security features in Fuseki.

[1] https://www.w3.org/2005/Incubator/webid/spec/
[2] https://github.com/AtomGraph/LinkedDataHub/issues/107


> >
> > ----
> >
> > "FMod_ABAC" is not related to jena-permissions.
> >
> > "FMod_" means Fuseki Module.
> > https://jena.apache.org/documentation/fuseki2/fuseki-modules
> >     No forks.
> > ABAC = Attribute Based Access Control.
> >
> > Using attributes separates ACLs from direct naming users for access to
> > things. FMod_ABAC things are triples. Triples have "labels". Labels are
> > attribute expressions, including AND and OR operators.
> >
> >      "employee | contractor" -- must have the "employee" attribute
> >                                 or the "contractor" attribute.
> >
> >      "employee & dept=engineering" -- must have both "employee" and
> >                                      "dept=engineering" attributes.
> >
> > There is a division of responsibilities. The data is labelled - so the
> > data owner is responsible for the data attribute requirements. The
> > assignment of attributes to users is separate.
> >
> > > FYI - In our case this means that we have a "make SPARQL query" API
> > > call. When received, the applicable user (our domain) is known and, in
> > > the proposed PR, we can prepend the set of allowed graphs to the query
> > > (which have been looked up prior to query execution, externally). The
> > > end user has NO direct access to Fuseki itself.
> >
> > You have a solution presuming a protected network, or possibly a
> > container with in-container networking.
> >
> > That's my Concern 1. Security conditions outside Jena must be met.
> > Having that, even if not in use, is an issue.
> >
>
> Maybe I misunderstand, but is this not in the same boat as:
> a) Configuring a service which allows write access (but not gating who
> can reach said service)
> b) Configuring Fuseki access control in config and allowing 1+ graphs
> (which shouldn't be included)
> c) Configuring a service which allows read access to all graphs (i.e.
> without Fuseki Graph ACL - again unintended)
>
> .. in that it's up to the Jena User to set up their deployment in a
> way that matches any security requirements.
> (The proposed feature, as a separate extension or part of Fuseki Graph
> ACL would have to be explicitly configured/enabled.)
>
> > >> Concern 1:
> > >>
> > >> This by passes Fuseki-provided security and puts the control function
> > >> outside the Fuseki server in a separate server that is not part of
> Jena.
> > >> It will only be secure if deployed in a constrained network
> environment.
> > >>
> > >> This is not secure except when run in a certain way and, personally, I
> > >> don't want to have to deal with a CVE because of that. CVE handling is
> > >> time consuming.
> > >>
> > >> I don't see why it is using jena-access (the named graph security
> > >> feature) except for the filtering on TDB. It is creating a dynamic
> > >> dataset for the query.
> > >
> > > You're right - it's only as secure as the middleware/proxy/whatever in
> > > front of it which supplies the ACL. (This was never intended to be
> > > used/exposed to end-users directly.)
> >
> > >> Concern 2: How does update fit into the picture? (GSP is not
> supported).
> > >
> > > I thought that, since GSP operations target a single graph, there is
> > > no need to extend support to it since it's already possible to
> > > restrict visibility (with the graph query parameter). Am I missing
> > > something?
> >
> > Having different ways to protect data across different operations is
> > confusing.  And quite easy to have unexpected problems which for
> > security is bad.
> >
> > Accessing the default graph when it is the union of the named graphs.
>
> Good point - I'd forgotten about the union. In that case I suppose
> that completely invalidates the proposal, since with GSP GET/HEAD of
> course don't have a body. (As explained in the PR-added readme,
> putting the allowed graphs in a header only works with a relatively
> small number of graphs, or if their IRIs are short.)
> .. unless GSP GET in union-mode was disallowed, when this feature is
> enabled.
>
> >
> > >>
> > >> Concern 3: It looks like a specific solution for a specific scenario.
> > >> Will it get uptake by the wide Jena user community?
> > >
> > > It's definitely specific. My thinking was that, if a subset of this
> > > were deemed useful, then it'd be better to exist as part of the core
> > > offering as opposed to us just bolting it on ourselves (at my job).
> > > But, if that's not the case - fair enough.
> >
> > What subsets do you have in mind?
>
> (In isolation of Fuseki Graph ACL) Allow Jena Users to supply (from an
> external-to-Fuseki/Jena system) a set of graphs to restrict SPARQL
> queries to (without having to rewrite the query) with similar
> performance to Fuseki Graph ACL (i.e. faster than the alternatives
> listed in the PR-attcached readme).
> Hmm, having just written that, I suppose that's not really a smaller
> subset.
>
> >
> >      Andy
>
> --
> Vilnis Termanis
> Technical Specialist
>
> e | vilnis.termanis@iotics.com
> www.iotics.com
>
> The information contained in this email is strictly confidential and
> intended only for the parties noted. If this email was not intended
> for your use, please contact Iotics. For more on our Privacy Policy
> please visit https://www.iotics.com/legal/
>

Re: About JENA-2339 - security related

Posted by Vilnis Termanis <Vi...@iotics.com.INVALID>.

(Apologies for the delay - I've been busy at work with other stuff)

On Mon, 15 Aug 2022 at 14:27, Andy Seaborne <an...@apache.org> wrote:
>
> There is one Jena user - it's Vilnis (for Iotics).
>
> Your use cases - whatever they are - are for the current product and
> will evolve. Whether the way you propose will support the evolution of
> the use cases in the future, say the next 5 years, is unclear (and I
> think quite unlikely both on security features because product feature
> evolve, and on wanting to working with spatial or text datasets).  Jena
> tries to give stability.
>
> The essence of the PR is ~30 lines in SecurityContextDynamic.
> The rest is rearranging the plumbing to have a magic user.
> This does not need DatasetGraphAccessControl.

You're right. I've come to the same conclusion. (I extended ACL to try
the approach because it was the quickest way to do so, at least so it
seemed when I started.)

>
> This could be in a custom query processor extending SPARQL_QueryDataset
>   overriding decideDataset delivered as Fuseki Module. (You can override
> the standard query processor (1 line of code) if you want all query
> services to have this, or all for a particular service (2 lines of
> code), or be a new endpoint that offers only SPARQL query over a view of
> the dataset. The latter is better for you because you can put API
> security on the endpoint. It's a opt-in, drop-in extension, to a
> standard distribution Fuseki/Main.
>
> The amount of code reuse from SecurityContextView is 20 lines maybe via
> SecurityContextView.filterTDB and the functionality could made into a
> function.
>
> Now your usage is not a security issue for the Fuseki server as the HTTP
> request interface is not changed. No interaction with GSP.
>
> So Iotics add their own query processor to a standard Fuseki server and
> can evolve the extension. Configuring the network for the extension is
> the responsibility of the Iotics deployment.
>

Noted. I'll have a go at that approach. In fact, I think I'll also try
(as an option) what both you & Martynas suggested: Rather than supply
a completely external ACL list, allow for specifying of a query to
determine the set of visible graphs (e.g. using WACL/Solid).

> The extension might even be interesting to other Jena users not as
> security feature but as for the dynamic view capability.
>
>  >>> using a "SELECT {} 1" query, and
>  >>> adding a certain set of graphs makes the queries on my laptop take:
>  >>> ~600 graphs ~115ms
>  >>> ~1500 graphs ~162ms
>  >>> ~3k graphs ~240ms
>  >>> ~6k graphs ~400ms
>  >>
>  >> That's an illustration of the current system but we don't know what
>  >> is the cause of the cost.
>  >>
>  >> What piece of the code is taking the time?
>  >> Maybe the right thing to do is make it faster.
>  >
>  > I haven't looked into this in great detail, but from my understanding
>  > the time taken is a combination of a) parsing the input of allowed
>  > graphs and b) generating a new SecurityContext (holding a hashmap of
>  > said graphs). If providing a set of allowed graphs in the proposed way
>  > is not a no-go, I'm happy to dig into where the cost is exactly.
>
> We haven't seen the queries you were making. It is difficult to believe
> that Java takes >100ms to build a 6K entry hash map.
>

Yes, that definitely doesn't sound quite right for a "SELECT {} 1"
query with a set of ~6k graph URIs each of ~48 ASCII chars. (It could
well be that the regex for parsing it is the issue and that wouldn't
be required anyway with a URL or form param. I'll look into it.)

> You mentioned that request line gets too long. True for GET but a SPARQL
> query request could be sent as a HTML form
> (application/x-www-form-urlencoded) so listing the graph using
> ?default-graph-uri=/?named-graph-uri= can be much larger than the
> practical GET limits.

That's a good point - I had forgotten about that! (We've been using
direct-POST all this time.)
I suppose it would allow for both GET with smallish lists and
form-encoded POST for larger ones.

Are there any (dis)advantages for POST by-form versus POST directly?
(I guess in the end, whether some arguments are in the URL path or
encoded in the body, doesn't really matter. Apart from that the SPARQL
query has to be url-decoded, unlike with direct-POST, I suppose.)

Anyway, thank you for all of your (and Martynas') input - much appreciated.

>
>      Andy
>
>


-- 
Vilnis Termanis
Technical Specialist

e | vilnis.termanis@iotics.com
www.iotics.com

Re: About JENA-2339 - security related

Posted by Vilnis Termanis <Vi...@iotics.com.INVALID>.

>  It does address individual graphs.

From my understanding (correct me if I'm wrong):
The Evaluator interface exposes the graph which could be checked
against but jena-permissions is currently limited to a single
graph/model, not a whole dataset. (I appreciate it could be extended
to support the latter.) This is why I looked at how fuseki-access does
it and went down that route. Also, from my understanding, if one is
only interested in controlling graph (rather than more granular s/p/o)
access, the TDB filter hooks can provide better performance (obviously
with a downside that they can only work with TDB* datasets, unlike
when using Jena Permissions).
The fact that jena-permissions can also be applied to updates is
indeed very interesting (but we at $company haven't gotten as far as
considering arbitrary end-user updates yet).

On Wed, 24 Aug 2022 at 12:15, Claude Warren <cl...@xenei.com> wrote:
>
> I am sorry that I am coming to this _VERY_ late.
>
> I don't understand why Permissions can not be used.  It does address
> individual graphs.  It does handle union graphs properly.  It does handle
> the difference between graph update and model update.  It calls back to the
> Security engine to effectively say "Can this user perform this operation on
> this IRI"  It also asks   "Can this user perform this operation on
> this/these triples in this IRI".
>
> I don't understand what queries you are trying to execute that can not be
> answered.
>
> Claude
>
>
>
> On Mon, Aug 15, 2022 at 2:27 PM Andy Seaborne <an...@apache.org> wrote:
>
> > There is one Jena user - it's Vilnis (for Iotics).
> >
> > Your use cases - whatever they are - are for the current product and
> > will evolve. Whether the way you propose will support the evolution of
> > the use cases in the future, say the next 5 years, is unclear (and I
> > think quite unlikely both on security features because product feature
> > evolve, and on wanting to working with spatial or text datasets).  Jena
> > tries to give stability.
> >
> > The essence of the PR is ~30 lines in SecurityContextDynamic.
> > The rest is rearranging the plumbing to have a magic user.
> > This does not need DatasetGraphAccessControl.
> >
> > This could be in a custom query processor extending SPARQL_QueryDataset
> >   overriding decideDataset delivered as Fuseki Module. (You can override
> > the standard query processor (1 line of code) if you want all query
> > services to have this, or all for a particular service (2 lines of
> > code), or be a new endpoint that offers only SPARQL query over a view of
> > the dataset. The latter is better for you because you can put API
> > security on the endpoint. It's a opt-in, drop-in extension, to a
> > standard distribution Fuseki/Main.
> >
> > The amount of code reuse from SecurityContextView is 20 lines maybe via
> > SecurityContextView.filterTDB and the functionality could made into a
> > function.
> >
> > Now your usage is not a security issue for the Fuseki server as the HTTP
> > request interface is not changed. No interaction with GSP.
> >
> > So Iotics add their own query processor to a standard Fuseki server and
> > can evolve the extension. Configuring the network for the extension is
> > the responsibility of the Iotics deployment.
> >
> > The extension might even be interesting to other Jena users not as
> > security feature but as for the dynamic view capability.
> >
> >  >>> using a "SELECT {} 1" query, and
> >  >>> adding a certain set of graphs makes the queries on my laptop take:
> >  >>> ~600 graphs ~115ms
> >  >>> ~1500 graphs ~162ms
> >  >>> ~3k graphs ~240ms
> >  >>> ~6k graphs ~400ms
> >  >>
> >  >> That's an illustration of the current system but we don't know what
> >  >> is the cause of the cost.
> >  >>
> >  >> What piece of the code is taking the time?
> >  >> Maybe the right thing to do is make it faster.
> >  >
> >  > I haven't looked into this in great detail, but from my understanding
> >  > the time taken is a combination of a) parsing the input of allowed
> >  > graphs and b) generating a new SecurityContext (holding a hashmap of
> >  > said graphs). If providing a set of allowed graphs in the proposed way
> >  > is not a no-go, I'm happy to dig into where the cost is exactly.
> >
> > We haven't seen the queries you were making. It is difficult to believe
> > that Java takes >100ms to build a 6K entry hash map.
> >
> > You mentioned that request line gets too long. True for GET but a SPARQL
> > query request could be sent as a HTML form
> > (application/x-www-form-urlencoded) so listing the graph using
> > ?default-graph-uri=/?named-graph-uri= can be much larger than the
> > practical GET limits.
> >
> >      Andy
> >
> >
> >
>
> --
> I like: Like Like - The likeliest place on the web
> <http://like-like.xenei.com>
> LinkedIn: http://www.linkedin.com/in/claudewarren



-- 
Vilnis Termanis
Technical Specialist

e | vilnis.termanis@iotics.com
www.iotics.com

The information contained in this email is strictly confidential and
intended only for the parties noted. If this email was not intended
for your use, please contact Iotics. For more on our Privacy Policy
please visit https://www.iotics.com/legal/

Re: About JENA-2339 - security related

Posted by Claude Warren <cl...@xenei.com>.

I am sorry that I am coming to this _VERY_ late.

I don't understand why Permissions can not be used.  It does address
individual graphs.  It does handle union graphs properly.  It does handle
the difference between graph update and model update.  It calls back to the
Security engine to effectively say "Can this user perform this operation on
this IRI"  It also asks   "Can this user perform this operation on
this/these triples in this IRI".

I don't understand what queries you are trying to execute that can not be
answered.

Claude



On Mon, Aug 15, 2022 at 2:27 PM Andy Seaborne <an...@apache.org> wrote:

> There is one Jena user - it's Vilnis (for Iotics).
>
> Your use cases - whatever they are - are for the current product and
> will evolve. Whether the way you propose will support the evolution of
> the use cases in the future, say the next 5 years, is unclear (and I
> think quite unlikely both on security features because product feature
> evolve, and on wanting to working with spatial or text datasets).  Jena
> tries to give stability.
>
> The essence of the PR is ~30 lines in SecurityContextDynamic.
> The rest is rearranging the plumbing to have a magic user.
> This does not need DatasetGraphAccessControl.
>
> This could be in a custom query processor extending SPARQL_QueryDataset
>   overriding decideDataset delivered as Fuseki Module. (You can override
> the standard query processor (1 line of code) if you want all query
> services to have this, or all for a particular service (2 lines of
> code), or be a new endpoint that offers only SPARQL query over a view of
> the dataset. The latter is better for you because you can put API
> security on the endpoint. It's a opt-in, drop-in extension, to a
> standard distribution Fuseki/Main.
>
> The amount of code reuse from SecurityContextView is 20 lines maybe via
> SecurityContextView.filterTDB and the functionality could made into a
> function.
>
> Now your usage is not a security issue for the Fuseki server as the HTTP
> request interface is not changed. No interaction with GSP.
>
> So Iotics add their own query processor to a standard Fuseki server and
> can evolve the extension. Configuring the network for the extension is
> the responsibility of the Iotics deployment.
>
> The extension might even be interesting to other Jena users not as
> security feature but as for the dynamic view capability.
>
>  >>> using a "SELECT {} 1" query, and
>  >>> adding a certain set of graphs makes the queries on my laptop take:
>  >>> ~600 graphs ~115ms
>  >>> ~1500 graphs ~162ms
>  >>> ~3k graphs ~240ms
>  >>> ~6k graphs ~400ms
>  >>
>  >> That's an illustration of the current system but we don't know what
>  >> is the cause of the cost.
>  >>
>  >> What piece of the code is taking the time?
>  >> Maybe the right thing to do is make it faster.
>  >
>  > I haven't looked into this in great detail, but from my understanding
>  > the time taken is a combination of a) parsing the input of allowed
>  > graphs and b) generating a new SecurityContext (holding a hashmap of
>  > said graphs). If providing a set of allowed graphs in the proposed way
>  > is not a no-go, I'm happy to dig into where the cost is exactly.
>
> We haven't seen the queries you were making. It is difficult to believe
> that Java takes >100ms to build a 6K entry hash map.
>
> You mentioned that request line gets too long. True for GET but a SPARQL
> query request could be sent as a HTML form
> (application/x-www-form-urlencoded) so listing the graph using
> ?default-graph-uri=/?named-graph-uri= can be much larger than the
> practical GET limits.
>
>      Andy
>
>
>

-- 
I like: Like Like - The likeliest place on the web
<http://like-like.xenei.com>
LinkedIn: http://www.linkedin.com/in/claudewarren

Re: About JENA-2339 - security related

Posted by Andy Seaborne <an...@apache.org>.

There is one Jena user - it's Vilnis (for Iotics).

Your use cases - whatever they are - are for the current product and 
will evolve. Whether the way you propose will support the evolution of 
the use cases in the future, say the next 5 years, is unclear (and I 
think quite unlikely both on security features because product feature 
evolve, and on wanting to working with spatial or text datasets).  Jena 
tries to give stability.

The essence of the PR is ~30 lines in SecurityContextDynamic.
The rest is rearranging the plumbing to have a magic user.
This does not need DatasetGraphAccessControl.

This could be in a custom query processor extending SPARQL_QueryDataset 
  overriding decideDataset delivered as Fuseki Module. (You can override 
the standard query processor (1 line of code) if you want all query 
services to have this, or all for a particular service (2 lines of 
code), or be a new endpoint that offers only SPARQL query over a view of 
the dataset. The latter is better for you because you can put API 
security on the endpoint. It's a opt-in, drop-in extension, to a 
standard distribution Fuseki/Main.

The amount of code reuse from SecurityContextView is 20 lines maybe via 
SecurityContextView.filterTDB and the functionality could made into a 
function.

Now your usage is not a security issue for the Fuseki server as the HTTP 
request interface is not changed. No interaction with GSP.

So Iotics add their own query processor to a standard Fuseki server and 
can evolve the extension. Configuring the network for the extension is 
the responsibility of the Iotics deployment.

The extension might even be interesting to other Jena users not as 
security feature but as for the dynamic view capability.

 >>> using a "SELECT {} 1" query, and
 >>> adding a certain set of graphs makes the queries on my laptop take:
 >>> ~600 graphs ~115ms
 >>> ~1500 graphs ~162ms
 >>> ~3k graphs ~240ms
 >>> ~6k graphs ~400ms
 >>
 >> That's an illustration of the current system but we don't know what 
 >> is the cause of the cost.
 >>
 >> What piece of the code is taking the time?
 >> Maybe the right thing to do is make it faster.
 >
 > I haven't looked into this in great detail, but from my understanding
 > the time taken is a combination of a) parsing the input of allowed
 > graphs and b) generating a new SecurityContext (holding a hashmap of
 > said graphs). If providing a set of allowed graphs in the proposed way
 > is not a no-go, I'm happy to dig into where the cost is exactly.

We haven't seen the queries you were making. It is difficult to believe 
that Java takes >100ms to build a 6K entry hash map.

You mentioned that request line gets too long. True for GET but a SPARQL 
query request could be sent as a HTML form 
(application/x-www-form-urlencoded) so listing the graph using 
?default-graph-uri=/?named-graph-uri= can be much larger than the 
practical GET limits.

     Andy

Re: About JENA-2339 - security related

Posted by Vilnis Termanis <Vi...@iotics.com.INVALID>.

On Mon, 1 Aug 2022 at 12:29, Andy Seaborne <an...@apache.org> wrote:
>
>
>
> On 28/07/2022 20:50, Vilnis Termanis wrote:
> > Hi Andy & Jena development community,
> >
> > (Answers inline - apologies if I repeat myself)
> >
> > FYI - Our aim is to enable end-users to make SPARQL queries whilst
> > respecting visibility restrictions.
> > I.e. users (indirectly) add sets of related triples to a dataset and
> > they can choose who has visibility (beyond themselves) over these,
> > either: Nobody, Everyone or a chosen set (which can be updated). Note
> > that this restriction is not by a specific subject or predicate.
> > (Although the sets of triples do have relationships - not all of them
> > are known in advance.)
>
> Let's clarify terminology here.
>
> A "Jena user" is a person or organisation that is downloading Jena,
> either as the formal release (source code) or convenience binaries (e.g.
> jars from Maven Central). The "convenience binaries" is the more usual case.
>
> Not Iotics users. Systems built with Jena have their own users.
> (The Apache License applies - including clause 7.)
>
> The responsibility is between the downstream system builder and their
> users of product or service being "fit for purpose".

Sorry about that - I should have been clearer with the terms.

In the submission - there is only one entity - the "Fuseki user" (e.g.
via BasicAuth) to which the dynamic mode applies. However, since this
is intended to be used a part of an integration (by Jena users - to
gate access to their own domain-specific end-users), the
authentication bit I think is irrelevant. (E.g. a separate service
endpoint could have the proposed functionality enabled and this is
what the integration calls.)

>
> > using a "SELECT {} 1" query, and
> > adding a certain set of graphs makes the queries on my laptop take:
> > ~600 graphs ~115ms
> > ~1500 graphs ~162ms
> > ~3k graphs ~240ms
> > ~6k graphs ~400ms
>
> That's an illustration of the current system but we don't know what is
> the cause of the cost.
>
> What piece of the code is taking the time?
> Maybe the right thing to do is make it faster.

I haven't looked into this in great detail, but from my understanding
the time taken is a combination of a) parsing the input of allowed
graphs and b) generating a new SecurityContext (holding a hashmap of
said graphs). If providing a set of allowed graphs in the proposed way
is not a no-go, I'm happy to dig into where the cost is exactly.

>
> And in the general area - what are you using for authentication?
>

For us right now, we're only using fuseki:auth "basic" for the
purposes of differentiating different access levels against Fuseki
Data Access Control configuration (by mapping those to Fuseki users),
e.g.:
Fuseki user1 => allowed to see graphs A & B
Fuseki user2 => allowed to see graphs B & C
Fuseki user3 => has the proposed feature dynamic-access feature
enabled (i.e. no access unless the pragma preamble exists in query
with 1+ graphs defined)

Said Fuseki users (=roles) are then chosen based on what the system
needs to do (domain-specific).

> There is some bearer auth support in the next release ... it does not
> provide complete bearer auth because it can't cover all cases (e.g. JWT
> validation). It is more of a framework template with which to build a
> local solution.

I'm showing my lack of JWT/Bearer auth knowledge - but is this
building block for what Martynas suggested, namely the token implies
the user to which dynamic ACL applies and then access can be
restricted e.g. via WACL/Solid? (Correct me if I'm wrong but is  this
still not a solution that involves ACL rules being stored in Jena or
at least be accessible via SPARQL for a SERVICE call?)

>
> ----
>
> "FMod_ABAC" is not related to jena-permissions.
>
> "FMod_" means Fuseki Module.
> https://jena.apache.org/documentation/fuseki2/fuseki-modules
>     No forks.
> ABAC = Attribute Based Access Control.
>
> Using attributes separates ACLs from direct naming users for access to
> things. FMod_ABAC things are triples. Triples have "labels". Labels are
> attribute expressions, including AND and OR operators.
>
>      "employee | contractor" -- must have the "employee" attribute
>                                 or the "contractor" attribute.
>
>      "employee & dept=engineering" -- must have both "employee" and
>                                      "dept=engineering" attributes.
>
> There is a division of responsibilities. The data is labelled - so the
> data owner is responsible for the data attribute requirements. The
> assignment of attributes to users is separate.
>
> > FYI - In our case this means that we have a "make SPARQL query" API
> > call. When received, the applicable user (our domain) is known and, in
> > the proposed PR, we can prepend the set of allowed graphs to the query
> > (which have been looked up prior to query execution, externally). The
> > end user has NO direct access to Fuseki itself.
>
> You have a solution presuming a protected network, or possibly a
> container with in-container networking.
>
> That's my Concern 1. Security conditions outside Jena must be met.
> Having that, even if not in use, is an issue.
>

Maybe I misunderstand, but is this not in the same boat as:
a) Configuring a service which allows write access (but not gating who
can reach said service)
b) Configuring Fuseki access control in config and allowing 1+ graphs
(which shouldn't be included)
c) Configuring a service which allows read access to all graphs (i.e.
without Fuseki Graph ACL - again unintended)

.. in that it's up to the Jena User to set up their deployment in a
way that matches any security requirements.
(The proposed feature, as a separate extension or part of Fuseki Graph
ACL would have to be explicitly configured/enabled.)

> >> Concern 1:
> >>
> >> This by passes Fuseki-provided security and puts the control function
> >> outside the Fuseki server in a separate server that is not part of Jena.
> >> It will only be secure if deployed in a constrained network environment.
> >>
> >> This is not secure except when run in a certain way and, personally, I
> >> don't want to have to deal with a CVE because of that. CVE handling is
> >> time consuming.
> >>
> >> I don't see why it is using jena-access (the named graph security
> >> feature) except for the filtering on TDB. It is creating a dynamic
> >> dataset for the query.
> >
> > You're right - it's only as secure as the middleware/proxy/whatever in
> > front of it which supplies the ACL. (This was never intended to be
> > used/exposed to end-users directly.)
>
> >> Concern 2: How does update fit into the picture? (GSP is not supported).
> >
> > I thought that, since GSP operations target a single graph, there is
> > no need to extend support to it since it's already possible to
> > restrict visibility (with the graph query parameter). Am I missing
> > something?
>
> Having different ways to protect data across different operations is
> confusing.  And quite easy to have unexpected problems which for
> security is bad.
>
> Accessing the default graph when it is the union of the named graphs.

Good point - I'd forgotten about the union. In that case I suppose
that completely invalidates the proposal, since with GSP GET/HEAD of
course don't have a body. (As explained in the PR-added readme,
putting the allowed graphs in a header only works with a relatively
small number of graphs, or if their IRIs are short.)
.. unless GSP GET in union-mode was disallowed, when this feature is enabled.

>
> >>
> >> Concern 3: It looks like a specific solution for a specific scenario.
> >> Will it get uptake by the wide Jena user community?
> >
> > It's definitely specific. My thinking was that, if a subset of this
> > were deemed useful, then it'd be better to exist as part of the core
> > offering as opposed to us just bolting it on ourselves (at my job).
> > But, if that's not the case - fair enough.
>
> What subsets do you have in mind?

(In isolation of Fuseki Graph ACL) Allow Jena Users to supply (from an
external-to-Fuseki/Jena system) a set of graphs to restrict SPARQL
queries to (without having to rewrite the query) with similar
performance to Fuseki Graph ACL (i.e. faster than the alternatives
listed in the PR-attcached readme).
Hmm, having just written that, I suppose that's not really a smaller subset.

>
>      Andy

-- 
Vilnis Termanis
Technical Specialist

e | vilnis.termanis@iotics.com
www.iotics.com

The information contained in this email is strictly confidential and
intended only for the parties noted. If this email was not intended
for your use, please contact Iotics. For more on our Privacy Policy
please visit https://www.iotics.com/legal/

Re: About JENA-2339 - security related

Posted by Andy Seaborne <an...@apache.org>.

On 28/07/2022 20:50, Vilnis Termanis wrote:
> Hi Andy & Jena development community,
> 
> (Answers inline - apologies if I repeat myself)
> 
> FYI - Our aim is to enable end-users to make SPARQL queries whilst
> respecting visibility restrictions.
> I.e. users (indirectly) add sets of related triples to a dataset and
> they can choose who has visibility (beyond themselves) over these,
> either: Nobody, Everyone or a chosen set (which can be updated). Note
> that this restriction is not by a specific subject or predicate.
> (Although the sets of triples do have relationships - not all of them
> are known in advance.)

Let's clarify terminology here.

A "Jena user" is a person or organisation that is downloading Jena, 
either as the formal release (source code) or convenience binaries (e.g. 
jars from Maven Central). The "convenience binaries" is the more usual case.

Not Iotics users. Systems built with Jena have their own users.
(The Apache License applies - including clause 7.)

The responsibility is between the downstream system builder and their 
users of product or service being "fit for purpose".

> using a "SELECT {} 1" query, and
> adding a certain set of graphs makes the queries on my laptop take:
> ~600 graphs ~115ms
> ~1500 graphs ~162ms
> ~3k graphs ~240ms
> ~6k graphs ~400ms

That's an illustration of the current system but we don't know what is 
the cause of the cost.

What piece of the code is taking the time?
Maybe the right thing to do is make it faster.

And in the general area - what are you using for authentication?

There is some bearer auth support in the next release ... it does not 
provide complete bearer auth because it can't cover all cases (e.g. JWT 
validation). It is more of a framework template with which to build a 
local solution.

----

"FMod_ABAC" is not related to jena-permissions.

"FMod_" means Fuseki Module.
https://jena.apache.org/documentation/fuseki2/fuseki-modules
    No forks.
ABAC = Attribute Based Access Control.

Using attributes separates ACLs from direct naming users for access to 
things. FMod_ABAC things are triples. Triples have "labels". Labels are 
attribute expressions, including AND and OR operators.

     "employee | contractor" -- must have the "employee" attribute
                                or the "contractor" attribute.

     "employee & dept=engineering" -- must have both "employee" and
                                     "dept=engineering" attributes.

There is a division of responsibilities. The data is labelled - so the 
data owner is responsible for the data attribute requirements. The 
assignment of attributes to users is separate.

> FYI - In our case this means that we have a "make SPARQL query" API
> call. When received, the applicable user (our domain) is known and, in
> the proposed PR, we can prepend the set of allowed graphs to the query
> (which have been looked up prior to query execution, externally). The
> end user has NO direct access to Fuseki itself.

You have a solution presuming a protected network, or possibly a 
container with in-container networking.

That's my Concern 1. Security conditions outside Jena must be met. 
Having that, even if not in use, is an issue.

>> Concern 1:
>>
>> This by passes Fuseki-provided security and puts the control function
>> outside the Fuseki server in a separate server that is not part of Jena.
>> It will only be secure if deployed in a constrained network environment.
>>
>> This is not secure except when run in a certain way and, personally, I
>> don't want to have to deal with a CVE because of that. CVE handling is
>> time consuming.
>>
>> I don't see why it is using jena-access (the named graph security
>> feature) except for the filtering on TDB. It is creating a dynamic
>> dataset for the query.
> 
> You're right - it's only as secure as the middleware/proxy/whatever in
> front of it which supplies the ACL. (This was never intended to be
> used/exposed to end-users directly.)

>> Concern 2: How does update fit into the picture? (GSP is not supported).
> 
> I thought that, since GSP operations target a single graph, there is
> no need to extend support to it since it's already possible to
> restrict visibility (with the graph query parameter). Am I missing
> something?

Having different ways to protect data across different operations is 
confusing.  And quite easy to have unexpected problems which for 
security is bad.

Accessing the default graph when it is the union of the named graphs.

>>
>> Concern 3: It looks like a specific solution for a specific scenario.
>> Will it get uptake by the wide Jena user community?
> 
> It's definitely specific. My thinking was that, if a subset of this
> were deemed useful, then it'd be better to exist as part of the core
> offering as opposed to us just bolting it on ourselves (at my job).
> But, if that's not the case - fair enough.

What subsets do you have in mind?

     Andy

Re: About JENA-2339 - security related

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

On Mon, 8 Aug 2022 at 17.21, Vilnis Termanis
<Vi...@iotics.com.invalid> wrote:

> On Sat, 30 Jul 2022 at 21:14, Martynas Jusevičius
> <ma...@atomgraph.com> wrote:
> >
> > On Fri, Jul 29, 2022 at 7:27 PM Vilnis Termanis
> > <Vi...@iotics.com.invalid> wrote:
> > >
> > > (inline)
> > >
> > > On Fri, 29 Jul 2022 at 07:56, Martynas Jusevičius
> > > <ma...@atomgraph.com> wrote:
> > > >
> > > > “Sets of triples” — aren’t these datasets?
> > > >
> > > > Couldn’t this use case be addressed by maintaining per-user
> datasets? Not
> > > > sure if Fuseki can create datasets on the fly, but this seems like a
> much
> > > > simpler feature to implement compared to a whole new ACL mechanism.
> > >
> > > The idea is, that if you had these "sets of triples" A-Z, one user
> > > might be allowed to see A-M and another C-Q. With per-user datasets
> > > you'd have to duplicate data to achieve that. And, when the ACL
> > > changes, you'd have to copy/move triples from one dataset to another.
> > > (Or am I missing a nuance to your proposal? Do you mean dynamically
> > > creating a new dataset which references graphs from another dataset?)
> >
> > No, not missing :)
> >
> > I mean it sounds like a useful feature, and we could probably find use
> > for it ourselves.
> >
> > But if the ACL is graph-scoped, can't it employ an existing ontology
> > such as WAC? [1]
> > It would be eating your own dogfood, and of course it being RDF you
> > could query and update your ACL using SPARQL.That would probably
> > require a meta-dataset containing ACL data for each secured dataset.
>
> It definitely could (and in fact, we are doing pretty much what you
> describe right now).
> However, thinking in general terms - aren't there two levels to such
> an ACL solution:
>
> 1) ACL is treated as completely external to Jena/Fuseki: Something
> else is responsible for providing the "allow list" of graphs. (And:
> Ideally there is no hard requirement to require a Java integration to
> use the feature.)


This option has the advantage of not being specific to Fuseki, meaning thay
it can work with any triplestore. The access check is encapsulated as a
SPARQL query and can be easily reused accross frameworks.


> 2) ACL is enabled by storing rules in a specific graph in a Jena
> dataset (and there I agree WAC seems very sensible - as you've
> linked).
>
> I'm querying about (1) where Jena/Fuseki is not necessarily the centre
> of the picture, but part of multiple components.
>
> >
> > As it happens we have an authorization request filter for Jersey that
> > checks WAC access using SPARQL:
> >
> https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/java/com/atomgraph/linkeddatahub/server/filter/request/AuthorizationFilter.java
> > The SPARQL query:
> >
> https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/webapp/WEB-INF/web.xml#L25
> >
> > [1] https://www.w3.org/wiki/WebAccessControl
> >
> > >
> > > >
> > > > On Thu, 28 Jul 2022 at 22.51, Vilnis Termanis
> > > > <Vi...@iotics.com.invalid> wrote:
> > > >
> > > > > Hi Andy & Jena development community,
> > > > >
> > > > > (Answers inline - apologies if I repeat myself)
> > > > >
> > > > > FYI - Our aim is to enable end-users to make SPARQL queries whilst
> > > > > respecting visibility restrictions.
> > > > > I.e. users (indirectly) add sets of related triples to a dataset
> and
> > > > > they can choose who has visibility (beyond themselves) over these,
> > > > > either: Nobody, Everyone or a chosen set (which can be updated).
> Note
> > > > > that this restriction is not by a specific subject or predicate.
> > > > > (Although the sets of triples do have relationships - not all of
> them
> > > > > are known in advance.)
> > > > >
> > > > > On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org>
> wrote:
> > > > > >
> > > > > > JENA-2339
> > > > > > PR#1441
> > > > > >
> > > > >
> https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
> > > > > >
> > > > > > tl;dr:
> > > > > >
> > > > > > It is a different role for Fuseki.
> > > > > >
> > > > > > Fuseki execute the security but the setup and control is from a
> trusted
> > > > > > external server on the request execution path.
> > > > > >
> > > > > > It assumes certain deployment environments to be safe.
> > > > >
> > > > > FYI - In our case this means that we have a "make SPARQL query" API
> > > > > call. When received, the applicable user (our domain) is known
> and, in
> > > > > the proposed PR, we can prepend the set of allowed graphs to the
> query
> > > > > (which have been looked up prior to query execution, externally).
> The
> > > > > end user has NO direct access to Fuseki itself.
> > > > >
> > > > > >
> > > > > > My feeling is that we should make Fuseki configurable enough so
> that a
> > > > > > downstream 3rd party can add their security solution that is
> suitable
> > > > > > for their environment. But we should not incorporate a particular
> > > > > > security solution that relies on the deployment environment.
> > > > > >
> > > > > > ----
> > > > > >
> > > > > > I've asked for more information about the claim on a performance
> > > > > > motivator and some other background information.
> > > > > >
> > > > > > The usage patterns are not yet clear. The data is described as
> "a one
> > > > > > graph per handful of subjects and their properties" and "100s of
> > > > > > graphs". What the queries are is unstated.
> > > > >
> > > > > Right now, each graph has in the range of 300-500 triples (though
> the
> > > > > amount depends on how much additional/domain-specific metadata
> > > > > end-users choose to add) and the scale of deployed Fuseki datasets
> > > > > range from having a few to ~6k graphs.
> > > > > Since we'd like to allow end-users to run **any** queries they wish
> > > > > (we enforce query timeouts), it's difficult to give concrete
> examples.
> > > > > I can however say that TDB unionDefaultGraph mode is enabled (i.e.
> > > > > most end-users won't choose to explicitly target a specific graph)
> and
> > > > > that one of our representative "search" queries (which combines
> > > > > GeoSPARQL + multiple explicit property matching across multiple
> > > > > different subjects in a UNION + subsequent collection of mandatory
> &
> > > > > optional fields) is between 20-40% faster than the current custom
> > > > > solution.
> > > > > (Note that we have also tried query re-writing to insert FROM/FROM
> > > > > NAMED clauses - and that is very slow in comparison, presumably to
> the
> > > > > higher level filtering involved, unlike the quad filter herein.)
> > > > >
> > > > > >
> > > > > > There is no characterisation of the queries being made. If we are
> > > > > > talking about overheads, the cases of a few big queries and many
> small
> > > > > > queries are different.
> > > > >
> > > > > (pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
> > > > > adding a certain set of graphs makes the queries on my laptop take:
> > > > > ~600 graphs ~115ms
> > > > > ~1500 graphs ~162ms
> > > > > ~3k graphs ~240ms
> > > > > ~6k graphs ~400ms
> > > > >
> > > > > >
> > > > > > The scale looks small (less than a million triples of triples -
> > > > > > approximating as 100 graphs * 1000 triples). That makes the
> point about
> > > > > > access to TDB hooks a bit redundant.
> > > > >
> > > > > The dataset I've tested this with has ~1.8M triples. That's not to
> say
> > > > > this is the scale we're hoping to satisfy - that's the just what I
> > > > > tested with first. By redundant, do you mean an alternative
> approach
> > > > > should be used for this scale?
> > > > >
> > > > > >
> > > > > >
> > > > > > There is are distinguished users. A request from one of these
> users
> > > > > > causes the set of visible graphs to be read from a comment at
> the start
> > > > > > of the query text in the request.
> > > > > >
> > > > > > The use of large numbers of small named graphs to manage security
> > > > > > settings looks to me like triple-level security.  I have already
> > > > > > mentioned work "FMod_ABAC": (£job related) awhile back
> (2/Jan/2022). It
> > > > > > is triple level attribute-based security.
> > > > >
> > > > > It could well be that I'm seeing the wrong solution for the feature
> > > > > we're trying to support (that's the other reason for reaching out
> to
> > > > > the community. The reason (rightly or wrongly) to model this as a
> set
> > > > > of graphs is: Each set of triples to be restricted are related, but
> > > > > span multiple subjects and could also relate to other subjects in
> > > > > other sets (as well as externally).
> > > > > Hence I couldn't see how e.g. Jena Permissions could be applied
> here:
> > > > > When you're provided with a single triple to check - you would
> have to
> > > > > understand what type subject it is and how it relates to the "top
> > > > > level" subject to which the ACL applies. Bundling everything into a
> > > > > graph seemed like viable option.
> > > > >
> > > > > >
> > > > > > Concern 1:
> > > > > >
> > > > > > This by passes Fuseki-provided security and puts the control
> function
> > > > > > outside the Fuseki server in a separate server that is not part
> of Jena.
> > > > > > It will only be secure if deployed in a constrained network
> environment.
> > > > > >
> > > > > > This is not secure except when run in a certain way and,
> personally, I
> > > > > > don't want to have to deal with a CVE because of that. CVE
> handling is
> > > > > > time consuming.
> > > > > >
> > > > > > I don't see why it is using jena-access (the named graph security
> > > > > > feature) except for the filtering on TDB. It is creating a
> dynamic
> > > > > > dataset for the query.
> > > > >
> > > > > You're right - it's only as secure as the
> middleware/proxy/whatever in
> > > > > front of it which supplies the ACL. (This was never intended to be
> > > > > used/exposed to end-users directly.)
> > > > > The purpose of extending jena-access (instead of immediately
> writing
> > > > > it as a separate module) was to illustrate with minimal code
> changes
> > > > > (+ extension of existing tests) what it could look like, for
> > > > > discussion. (The quad filtering / performance aspect would be the
> > > > > same, regardless of location, I presume.)
> > > > >
> > > > > >
> > > > > > Concern 2: How does update fit into the picture? (GSP is not
> supported).
> > > > >
> > > > > I thought that, since GSP operations target a single graph, there
> is
> > > > > no need to extend support to it since it's already possible to
> > > > > restrict visibility (with the graph query parameter). Am I missing
> > > > > something?
> > > > >
> > > > > >
> > > > > > Concern 3: It looks like a specific solution for a specific
> scenario.
> > > > > > Will it get uptake by the wide Jena user community?
> > > > >
> > > > > It's definitely specific. My thinking was that, if a subset of this
> > > > > were deemed useful, then it'd be better to exist as part of the
> core
> > > > > offering as opposed to us just bolting it on ourselves (at my job).
> > > > > But, if that's not the case - fair enough.
> > > > >
> > > > > >
> > > > > > Concern 4: Is there long-term support and maintenance for the
> feature?
> > > > > > (e.g. 5y+)
> > > > > > How do we respond to users@ message about it? Is it
> experimental code or
> > > > > > has it been used for real? Is the feature set stable?
> > > > >
> > > > > My understanding is that jena-access is classed as stable (we're
> using
> > > > > it for something else already in production) and thus, since this
> > > > > merely produces a SecurityContext with a larger set of graphs,
> would
> > > > > theoretically be no less stable.
> > > > >
> > > > > >
> > > > > >
> > > > > > Opinion: it is not unreasonable to provide support for this kind
> of
> > > > > > customization of Fuseki.
> > > > > >
> > > > > > An extension can then provide whatever security is needed for the
> > > > > > situation and it is the Fuseki user/operator making the
> decisions about
> > > > > > what is acceptable security and what isn't.
> > > > > >
> > > > > > Fuseki has ways to add custom processors and this seems the way
> to
> > > > > > provide an alternative way to make queries.
> > > > > >
> > > > > > Putting it in the distribution codebase is a big step for the
> project.
> > > > > > At the very least, it needs to be mature and likely to be used.
> > > > >
> > > > > We wouldn't be reaching out if we weren't likely to want to use
> such a
> > > > > feature. All these concerns/questions/suggestions are exactly what
> we
> > > > > were hoping for. If I can provide any more context/tests/samples,
> let
> > > > > me know.
> > > > > (I completely get the concerns about diluting a known security
> feature
> > > > > and have no issue with something like this being a separate
> > > > > component.)
> > > > >
> > > > > >
> > > > > > Background: Currently jena-access is in Fuseki main. It is not
> optional
> > > > > > because it predates Fuseki modules.
> > > > > >
> > > > > >      Andy
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Vilnis Termanis
> > > > > Technical Specialist
> > > > >
> > > > > e | vilnis.termanis@iotics.com
> > > > > www.iotics.com
> > > > >
> > >
> > >
> > >
> > > --
> > > Vilnis Termanis
> > > Technical Specialist
> > >
> > > e | vilnis.termanis@iotics.com
> > > www.iotics.com
> > >
> > > The information contained in this email is strictly confidential and
> > > intended only for the parties noted. If this email was not intended
> > > for your use, please contact Iotics. For more on our Privacy Policy
> > > please visit https://www.iotics.com/legal/
>
>
>
> --
> Vilnis Termanis
> Technical Specialist
>
> e | vilnis.termanis@iotics.com
> www.iotics.com
>
> The information contained in this email is strictly confidential and
> intended only for the parties noted. If this email was not intended
> for your use, please contact Iotics. For more on our Privacy Policy
> please visit https://www.iotics.com/legal/
>

Re: About JENA-2339 - security related

Posted by Vilnis Termanis <Vi...@iotics.com.INVALID>.

On Sat, 30 Jul 2022 at 21:14, Martynas Jusevičius
<ma...@atomgraph.com> wrote:
>
> On Fri, Jul 29, 2022 at 7:27 PM Vilnis Termanis
> <Vi...@iotics.com.invalid> wrote:
> >
> > (inline)
> >
> > On Fri, 29 Jul 2022 at 07:56, Martynas Jusevičius
> > <ma...@atomgraph.com> wrote:
> > >
> > > “Sets of triples” — aren’t these datasets?
> > >
> > > Couldn’t this use case be addressed by maintaining per-user datasets? Not
> > > sure if Fuseki can create datasets on the fly, but this seems like a much
> > > simpler feature to implement compared to a whole new ACL mechanism.
> >
> > The idea is, that if you had these "sets of triples" A-Z, one user
> > might be allowed to see A-M and another C-Q. With per-user datasets
> > you'd have to duplicate data to achieve that. And, when the ACL
> > changes, you'd have to copy/move triples from one dataset to another.
> > (Or am I missing a nuance to your proposal? Do you mean dynamically
> > creating a new dataset which references graphs from another dataset?)
>
> No, not missing :)
>
> I mean it sounds like a useful feature, and we could probably find use
> for it ourselves.
>
> But if the ACL is graph-scoped, can't it employ an existing ontology
> such as WAC? [1]
> It would be eating your own dogfood, and of course it being RDF you
> could query and update your ACL using SPARQL.That would probably
> require a meta-dataset containing ACL data for each secured dataset.

It definitely could (and in fact, we are doing pretty much what you
describe right now).
However, thinking in general terms - aren't there two levels to such
an ACL solution:

1) ACL is treated as completely external to Jena/Fuseki: Something
else is responsible for providing the "allow list" of graphs. (And:
Ideally there is no hard requirement to require a Java integration to
use the feature.)
2) ACL is enabled by storing rules in a specific graph in a Jena
dataset (and there I agree WAC seems very sensible - as you've
linked).

I'm querying about (1) where Jena/Fuseki is not necessarily the centre
of the picture, but part of multiple components.

>
> As it happens we have an authorization request filter for Jersey that
> checks WAC access using SPARQL:
> https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/java/com/atomgraph/linkeddatahub/server/filter/request/AuthorizationFilter.java
> The SPARQL query:
> https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/webapp/WEB-INF/web.xml#L25
>
> [1] https://www.w3.org/wiki/WebAccessControl
>
> >
> > >
> > > On Thu, 28 Jul 2022 at 22.51, Vilnis Termanis
> > > <Vi...@iotics.com.invalid> wrote:
> > >
> > > > Hi Andy & Jena development community,
> > > >
> > > > (Answers inline - apologies if I repeat myself)
> > > >
> > > > FYI - Our aim is to enable end-users to make SPARQL queries whilst
> > > > respecting visibility restrictions.
> > > > I.e. users (indirectly) add sets of related triples to a dataset and
> > > > they can choose who has visibility (beyond themselves) over these,
> > > > either: Nobody, Everyone or a chosen set (which can be updated). Note
> > > > that this restriction is not by a specific subject or predicate.
> > > > (Although the sets of triples do have relationships - not all of them
> > > > are known in advance.)
> > > >
> > > > On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org> wrote:
> > > > >
> > > > > JENA-2339
> > > > > PR#1441
> > > > >
> > > > https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
> > > > >
> > > > > tl;dr:
> > > > >
> > > > > It is a different role for Fuseki.
> > > > >
> > > > > Fuseki execute the security but the setup and control is from a trusted
> > > > > external server on the request execution path.
> > > > >
> > > > > It assumes certain deployment environments to be safe.
> > > >
> > > > FYI - In our case this means that we have a "make SPARQL query" API
> > > > call. When received, the applicable user (our domain) is known and, in
> > > > the proposed PR, we can prepend the set of allowed graphs to the query
> > > > (which have been looked up prior to query execution, externally). The
> > > > end user has NO direct access to Fuseki itself.
> > > >
> > > > >
> > > > > My feeling is that we should make Fuseki configurable enough so that a
> > > > > downstream 3rd party can add their security solution that is suitable
> > > > > for their environment. But we should not incorporate a particular
> > > > > security solution that relies on the deployment environment.
> > > > >
> > > > > ----
> > > > >
> > > > > I've asked for more information about the claim on a performance
> > > > > motivator and some other background information.
> > > > >
> > > > > The usage patterns are not yet clear. The data is described as "a one
> > > > > graph per handful of subjects and their properties" and "100s of
> > > > > graphs". What the queries are is unstated.
> > > >
> > > > Right now, each graph has in the range of 300-500 triples (though the
> > > > amount depends on how much additional/domain-specific metadata
> > > > end-users choose to add) and the scale of deployed Fuseki datasets
> > > > range from having a few to ~6k graphs.
> > > > Since we'd like to allow end-users to run **any** queries they wish
> > > > (we enforce query timeouts), it's difficult to give concrete examples.
> > > > I can however say that TDB unionDefaultGraph mode is enabled (i.e.
> > > > most end-users won't choose to explicitly target a specific graph) and
> > > > that one of our representative "search" queries (which combines
> > > > GeoSPARQL + multiple explicit property matching across multiple
> > > > different subjects in a UNION + subsequent collection of mandatory &
> > > > optional fields) is between 20-40% faster than the current custom
> > > > solution.
> > > > (Note that we have also tried query re-writing to insert FROM/FROM
> > > > NAMED clauses - and that is very slow in comparison, presumably to the
> > > > higher level filtering involved, unlike the quad filter herein.)
> > > >
> > > > >
> > > > > There is no characterisation of the queries being made. If we are
> > > > > talking about overheads, the cases of a few big queries and many small
> > > > > queries are different.
> > > >
> > > > (pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
> > > > adding a certain set of graphs makes the queries on my laptop take:
> > > > ~600 graphs ~115ms
> > > > ~1500 graphs ~162ms
> > > > ~3k graphs ~240ms
> > > > ~6k graphs ~400ms
> > > >
> > > > >
> > > > > The scale looks small (less than a million triples of triples -
> > > > > approximating as 100 graphs * 1000 triples). That makes the point about
> > > > > access to TDB hooks a bit redundant.
> > > >
> > > > The dataset I've tested this with has ~1.8M triples. That's not to say
> > > > this is the scale we're hoping to satisfy - that's the just what I
> > > > tested with first. By redundant, do you mean an alternative approach
> > > > should be used for this scale?
> > > >
> > > > >
> > > > >
> > > > > There is are distinguished users. A request from one of these users
> > > > > causes the set of visible graphs to be read from a comment at the start
> > > > > of the query text in the request.
> > > > >
> > > > > The use of large numbers of small named graphs to manage security
> > > > > settings looks to me like triple-level security.  I have already
> > > > > mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
> > > > > is triple level attribute-based security.
> > > >
> > > > It could well be that I'm seeing the wrong solution for the feature
> > > > we're trying to support (that's the other reason for reaching out to
> > > > the community. The reason (rightly or wrongly) to model this as a set
> > > > of graphs is: Each set of triples to be restricted are related, but
> > > > span multiple subjects and could also relate to other subjects in
> > > > other sets (as well as externally).
> > > > Hence I couldn't see how e.g. Jena Permissions could be applied here:
> > > > When you're provided with a single triple to check - you would have to
> > > > understand what type subject it is and how it relates to the "top
> > > > level" subject to which the ACL applies. Bundling everything into a
> > > > graph seemed like viable option.
> > > >
> > > > >
> > > > > Concern 1:
> > > > >
> > > > > This by passes Fuseki-provided security and puts the control function
> > > > > outside the Fuseki server in a separate server that is not part of Jena.
> > > > > It will only be secure if deployed in a constrained network environment.
> > > > >
> > > > > This is not secure except when run in a certain way and, personally, I
> > > > > don't want to have to deal with a CVE because of that. CVE handling is
> > > > > time consuming.
> > > > >
> > > > > I don't see why it is using jena-access (the named graph security
> > > > > feature) except for the filtering on TDB. It is creating a dynamic
> > > > > dataset for the query.
> > > >
> > > > You're right - it's only as secure as the middleware/proxy/whatever in
> > > > front of it which supplies the ACL. (This was never intended to be
> > > > used/exposed to end-users directly.)
> > > > The purpose of extending jena-access (instead of immediately writing
> > > > it as a separate module) was to illustrate with minimal code changes
> > > > (+ extension of existing tests) what it could look like, for
> > > > discussion. (The quad filtering / performance aspect would be the
> > > > same, regardless of location, I presume.)
> > > >
> > > > >
> > > > > Concern 2: How does update fit into the picture? (GSP is not supported).
> > > >
> > > > I thought that, since GSP operations target a single graph, there is
> > > > no need to extend support to it since it's already possible to
> > > > restrict visibility (with the graph query parameter). Am I missing
> > > > something?
> > > >
> > > > >
> > > > > Concern 3: It looks like a specific solution for a specific scenario.
> > > > > Will it get uptake by the wide Jena user community?
> > > >
> > > > It's definitely specific. My thinking was that, if a subset of this
> > > > were deemed useful, then it'd be better to exist as part of the core
> > > > offering as opposed to us just bolting it on ourselves (at my job).
> > > > But, if that's not the case - fair enough.
> > > >
> > > > >
> > > > > Concern 4: Is there long-term support and maintenance for the feature?
> > > > > (e.g. 5y+)
> > > > > How do we respond to users@ message about it? Is it experimental code or
> > > > > has it been used for real? Is the feature set stable?
> > > >
> > > > My understanding is that jena-access is classed as stable (we're using
> > > > it for something else already in production) and thus, since this
> > > > merely produces a SecurityContext with a larger set of graphs, would
> > > > theoretically be no less stable.
> > > >
> > > > >
> > > > >
> > > > > Opinion: it is not unreasonable to provide support for this kind of
> > > > > customization of Fuseki.
> > > > >
> > > > > An extension can then provide whatever security is needed for the
> > > > > situation and it is the Fuseki user/operator making the decisions about
> > > > > what is acceptable security and what isn't.
> > > > >
> > > > > Fuseki has ways to add custom processors and this seems the way to
> > > > > provide an alternative way to make queries.
> > > > >
> > > > > Putting it in the distribution codebase is a big step for the project.
> > > > > At the very least, it needs to be mature and likely to be used.
> > > >
> > > > We wouldn't be reaching out if we weren't likely to want to use such a
> > > > feature. All these concerns/questions/suggestions are exactly what we
> > > > were hoping for. If I can provide any more context/tests/samples, let
> > > > me know.
> > > > (I completely get the concerns about diluting a known security feature
> > > > and have no issue with something like this being a separate
> > > > component.)
> > > >
> > > > >
> > > > > Background: Currently jena-access is in Fuseki main. It is not optional
> > > > > because it predates Fuseki modules.
> > > > >
> > > > >      Andy
> > > >
> > > >
> > > >
> > > > --
> > > > Vilnis Termanis
> > > > Technical Specialist
> > > >
> > > > e | vilnis.termanis@iotics.com
> > > > www.iotics.com
> > > >
> >
> >
> >
> > --
> > Vilnis Termanis
> > Technical Specialist
> >
> > e | vilnis.termanis@iotics.com
> > www.iotics.com
> >
> > The information contained in this email is strictly confidential and
> > intended only for the parties noted. If this email was not intended
> > for your use, please contact Iotics. For more on our Privacy Policy
> > please visit https://www.iotics.com/legal/



-- 
Vilnis Termanis
Technical Specialist

e | vilnis.termanis@iotics.com
www.iotics.com

The information contained in this email is strictly confidential and
intended only for the parties noted. If this email was not intended
for your use, please contact Iotics. For more on our Privacy Policy
please visit https://www.iotics.com/legal/

Re: About JENA-2339 - security related

Posted by Andy Seaborne <an...@apache.org>.


On 30/07/2022 21:14, Martynas Jusevičius wrote:
> On Fri, Jul 29, 2022 at 7:27 PM Vilnis Termanis
> <Vi...@iotics.com.invalid> wrote:
>>
>> (inline)
>>
>> On Fri, 29 Jul 2022 at 07:56, Martynas Jusevičius
>> <ma...@atomgraph.com> wrote:
>>>
>>> “Sets of triples” — aren’t these datasets?
>>>
>>> Couldn’t this use case be addressed by maintaining per-user datasets? Not
>>> sure if Fuseki can create datasets on the fly, but this seems like a much
>>> simpler feature to implement compared to a whole new ACL mechanism.
>>
>> The idea is, that if you had these "sets of triples" A-Z, one user
>> might be allowed to see A-M and another C-Q. With per-user datasets
>> you'd have to duplicate data to achieve that. And, when the ACL
>> changes, you'd have to copy/move triples from one dataset to another.
>> (Or am I missing a nuance to your proposal? Do you mean dynamically
>> creating a new dataset which references graphs from another dataset?)
> 
> No, not missing :)
> 
> I mean it sounds like a useful feature, and we could probably find use
> for it ourselves.
> 
> But if the ACL is graph-scoped, can't it employ an existing ontology
> such as WAC? [1]

The description of NG usage so far does sound quite SOLID-like where 
data security on resources becomes API security (HTTP Methods) due to 
pods. I'm not clear how SOLID treats query across pods though other than 
in the style of Communica fetching data as the query runs.

     Andy

> It would be eating your own dogfood, and of course it being RDF you
> could query and update your ACL using SPARQL.That would probably
> require a meta-dataset containing ACL data for each secured dataset.
> 
> As it happens we have an authorization request filter for Jersey that
> checks WAC access using SPARQL:
> https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/java/com/atomgraph/linkeddatahub/server/filter/request/AuthorizationFilter.java
> The SPARQL query:
> https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/webapp/WEB-INF/web.xml#L25
> 
> [1] https://www.w3.org/wiki/WebAccessControl
> 
>>
>>>
>>> On Thu, 28 Jul 2022 at 22.51, Vilnis Termanis
>>> <Vi...@iotics.com.invalid> wrote:
>>>
>>>> Hi Andy & Jena development community,
>>>>
>>>> (Answers inline - apologies if I repeat myself)
>>>>
>>>> FYI - Our aim is to enable end-users to make SPARQL queries whilst
>>>> respecting visibility restrictions.
>>>> I.e. users (indirectly) add sets of related triples to a dataset and
>>>> they can choose who has visibility (beyond themselves) over these,
>>>> either: Nobody, Everyone or a chosen set (which can be updated). Note
>>>> that this restriction is not by a specific subject or predicate.
>>>> (Although the sets of triples do have relationships - not all of them
>>>> are known in advance.)
>>>>
>>>> On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org> wrote:
>>>>>
>>>>> JENA-2339
>>>>> PR#1441
>>>>>
>>>> https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
>>>>>
>>>>> tl;dr:
>>>>>
>>>>> It is a different role for Fuseki.
>>>>>
>>>>> Fuseki execute the security but the setup and control is from a trusted
>>>>> external server on the request execution path.
>>>>>
>>>>> It assumes certain deployment environments to be safe.
>>>>
>>>> FYI - In our case this means that we have a "make SPARQL query" API
>>>> call. When received, the applicable user (our domain) is known and, in
>>>> the proposed PR, we can prepend the set of allowed graphs to the query
>>>> (which have been looked up prior to query execution, externally). The
>>>> end user has NO direct access to Fuseki itself.
>>>>
>>>>>
>>>>> My feeling is that we should make Fuseki configurable enough so that a
>>>>> downstream 3rd party can add their security solution that is suitable
>>>>> for their environment. But we should not incorporate a particular
>>>>> security solution that relies on the deployment environment.
>>>>>
>>>>> ----
>>>>>
>>>>> I've asked for more information about the claim on a performance
>>>>> motivator and some other background information.
>>>>>
>>>>> The usage patterns are not yet clear. The data is described as "a one
>>>>> graph per handful of subjects and their properties" and "100s of
>>>>> graphs". What the queries are is unstated.
>>>>
>>>> Right now, each graph has in the range of 300-500 triples (though the
>>>> amount depends on how much additional/domain-specific metadata
>>>> end-users choose to add) and the scale of deployed Fuseki datasets
>>>> range from having a few to ~6k graphs.
>>>> Since we'd like to allow end-users to run **any** queries they wish
>>>> (we enforce query timeouts), it's difficult to give concrete examples.
>>>> I can however say that TDB unionDefaultGraph mode is enabled (i.e.
>>>> most end-users won't choose to explicitly target a specific graph) and
>>>> that one of our representative "search" queries (which combines
>>>> GeoSPARQL + multiple explicit property matching across multiple
>>>> different subjects in a UNION + subsequent collection of mandatory &
>>>> optional fields) is between 20-40% faster than the current custom
>>>> solution.
>>>> (Note that we have also tried query re-writing to insert FROM/FROM
>>>> NAMED clauses - and that is very slow in comparison, presumably to the
>>>> higher level filtering involved, unlike the quad filter herein.)
>>>>
>>>>>
>>>>> There is no characterisation of the queries being made. If we are
>>>>> talking about overheads, the cases of a few big queries and many small
>>>>> queries are different.
>>>>
>>>> (pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
>>>> adding a certain set of graphs makes the queries on my laptop take:
>>>> ~600 graphs ~115ms
>>>> ~1500 graphs ~162ms
>>>> ~3k graphs ~240ms
>>>> ~6k graphs ~400ms
>>>>
>>>>>
>>>>> The scale looks small (less than a million triples of triples -
>>>>> approximating as 100 graphs * 1000 triples). That makes the point about
>>>>> access to TDB hooks a bit redundant.
>>>>
>>>> The dataset I've tested this with has ~1.8M triples. That's not to say
>>>> this is the scale we're hoping to satisfy - that's the just what I
>>>> tested with first. By redundant, do you mean an alternative approach
>>>> should be used for this scale?
>>>>
>>>>>
>>>>>
>>>>> There is are distinguished users. A request from one of these users
>>>>> causes the set of visible graphs to be read from a comment at the start
>>>>> of the query text in the request.
>>>>>
>>>>> The use of large numbers of small named graphs to manage security
>>>>> settings looks to me like triple-level security.  I have already
>>>>> mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
>>>>> is triple level attribute-based security.
>>>>
>>>> It could well be that I'm seeing the wrong solution for the feature
>>>> we're trying to support (that's the other reason for reaching out to
>>>> the community. The reason (rightly or wrongly) to model this as a set
>>>> of graphs is: Each set of triples to be restricted are related, but
>>>> span multiple subjects and could also relate to other subjects in
>>>> other sets (as well as externally).
>>>> Hence I couldn't see how e.g. Jena Permissions could be applied here:
>>>> When you're provided with a single triple to check - you would have to
>>>> understand what type subject it is and how it relates to the "top
>>>> level" subject to which the ACL applies. Bundling everything into a
>>>> graph seemed like viable option.
>>>>
>>>>>
>>>>> Concern 1:
>>>>>
>>>>> This by passes Fuseki-provided security and puts the control function
>>>>> outside the Fuseki server in a separate server that is not part of Jena.
>>>>> It will only be secure if deployed in a constrained network environment.
>>>>>
>>>>> This is not secure except when run in a certain way and, personally, I
>>>>> don't want to have to deal with a CVE because of that. CVE handling is
>>>>> time consuming.
>>>>>
>>>>> I don't see why it is using jena-access (the named graph security
>>>>> feature) except for the filtering on TDB. It is creating a dynamic
>>>>> dataset for the query.
>>>>
>>>> You're right - it's only as secure as the middleware/proxy/whatever in
>>>> front of it which supplies the ACL. (This was never intended to be
>>>> used/exposed to end-users directly.)
>>>> The purpose of extending jena-access (instead of immediately writing
>>>> it as a separate module) was to illustrate with minimal code changes
>>>> (+ extension of existing tests) what it could look like, for
>>>> discussion. (The quad filtering / performance aspect would be the
>>>> same, regardless of location, I presume.)
>>>>
>>>>>
>>>>> Concern 2: How does update fit into the picture? (GSP is not supported).
>>>>
>>>> I thought that, since GSP operations target a single graph, there is
>>>> no need to extend support to it since it's already possible to
>>>> restrict visibility (with the graph query parameter). Am I missing
>>>> something?
>>>>
>>>>>
>>>>> Concern 3: It looks like a specific solution for a specific scenario.
>>>>> Will it get uptake by the wide Jena user community?
>>>>
>>>> It's definitely specific. My thinking was that, if a subset of this
>>>> were deemed useful, then it'd be better to exist as part of the core
>>>> offering as opposed to us just bolting it on ourselves (at my job).
>>>> But, if that's not the case - fair enough.
>>>>
>>>>>
>>>>> Concern 4: Is there long-term support and maintenance for the feature?
>>>>> (e.g. 5y+)
>>>>> How do we respond to users@ message about it? Is it experimental code or
>>>>> has it been used for real? Is the feature set stable?
>>>>
>>>> My understanding is that jena-access is classed as stable (we're using
>>>> it for something else already in production) and thus, since this
>>>> merely produces a SecurityContext with a larger set of graphs, would
>>>> theoretically be no less stable.
>>>>
>>>>>
>>>>>
>>>>> Opinion: it is not unreasonable to provide support for this kind of
>>>>> customization of Fuseki.
>>>>>
>>>>> An extension can then provide whatever security is needed for the
>>>>> situation and it is the Fuseki user/operator making the decisions about
>>>>> what is acceptable security and what isn't.
>>>>>
>>>>> Fuseki has ways to add custom processors and this seems the way to
>>>>> provide an alternative way to make queries.
>>>>>
>>>>> Putting it in the distribution codebase is a big step for the project.
>>>>> At the very least, it needs to be mature and likely to be used.
>>>>
>>>> We wouldn't be reaching out if we weren't likely to want to use such a
>>>> feature. All these concerns/questions/suggestions are exactly what we
>>>> were hoping for. If I can provide any more context/tests/samples, let
>>>> me know.
>>>> (I completely get the concerns about diluting a known security feature
>>>> and have no issue with something like this being a separate
>>>> component.)
>>>>
>>>>>
>>>>> Background: Currently jena-access is in Fuseki main. It is not optional
>>>>> because it predates Fuseki modules.
>>>>>
>>>>>       Andy
>>>>
>>>>
>>>>
>>>> --
>>>> Vilnis Termanis
>>>> Technical Specialist
>>>>
>>>> e | vilnis.termanis@iotics.com
>>>> www.iotics.com
>>>>
>>
>>
>>
>> --
>> Vilnis Termanis
>> Technical Specialist
>>
>> e | vilnis.termanis@iotics.com
>> www.iotics.com
>>
>> The information contained in this email is strictly confidential and
>> intended only for the parties noted. If this email was not intended
>> for your use, please contact Iotics. For more on our Privacy Policy
>> please visit https://www.iotics.com/legal/

Re: About JENA-2339 - security related

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

On Fri, Jul 29, 2022 at 7:27 PM Vilnis Termanis
<Vi...@iotics.com.invalid> wrote:
>
> (inline)
>
> On Fri, 29 Jul 2022 at 07:56, Martynas Jusevičius
> <ma...@atomgraph.com> wrote:
> >
> > “Sets of triples” — aren’t these datasets?
> >
> > Couldn’t this use case be addressed by maintaining per-user datasets? Not
> > sure if Fuseki can create datasets on the fly, but this seems like a much
> > simpler feature to implement compared to a whole new ACL mechanism.
>
> The idea is, that if you had these "sets of triples" A-Z, one user
> might be allowed to see A-M and another C-Q. With per-user datasets
> you'd have to duplicate data to achieve that. And, when the ACL
> changes, you'd have to copy/move triples from one dataset to another.
> (Or am I missing a nuance to your proposal? Do you mean dynamically
> creating a new dataset which references graphs from another dataset?)

No, not missing :)

I mean it sounds like a useful feature, and we could probably find use
for it ourselves.

But if the ACL is graph-scoped, can't it employ an existing ontology
such as WAC? [1]
It would be eating your own dogfood, and of course it being RDF you
could query and update your ACL using SPARQL.That would probably
require a meta-dataset containing ACL data for each secured dataset.

As it happens we have an authorization request filter for Jersey that
checks WAC access using SPARQL:
https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/java/com/atomgraph/linkeddatahub/server/filter/request/AuthorizationFilter.java
The SPARQL query:
https://github.com/AtomGraph/LinkedDataHub/blob/master/src/main/webapp/WEB-INF/web.xml#L25

[1] https://www.w3.org/wiki/WebAccessControl

>
> >
> > On Thu, 28 Jul 2022 at 22.51, Vilnis Termanis
> > <Vi...@iotics.com.invalid> wrote:
> >
> > > Hi Andy & Jena development community,
> > >
> > > (Answers inline - apologies if I repeat myself)
> > >
> > > FYI - Our aim is to enable end-users to make SPARQL queries whilst
> > > respecting visibility restrictions.
> > > I.e. users (indirectly) add sets of related triples to a dataset and
> > > they can choose who has visibility (beyond themselves) over these,
> > > either: Nobody, Everyone or a chosen set (which can be updated). Note
> > > that this restriction is not by a specific subject or predicate.
> > > (Although the sets of triples do have relationships - not all of them
> > > are known in advance.)
> > >
> > > On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org> wrote:
> > > >
> > > > JENA-2339
> > > > PR#1441
> > > >
> > > https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
> > > >
> > > > tl;dr:
> > > >
> > > > It is a different role for Fuseki.
> > > >
> > > > Fuseki execute the security but the setup and control is from a trusted
> > > > external server on the request execution path.
> > > >
> > > > It assumes certain deployment environments to be safe.
> > >
> > > FYI - In our case this means that we have a "make SPARQL query" API
> > > call. When received, the applicable user (our domain) is known and, in
> > > the proposed PR, we can prepend the set of allowed graphs to the query
> > > (which have been looked up prior to query execution, externally). The
> > > end user has NO direct access to Fuseki itself.
> > >
> > > >
> > > > My feeling is that we should make Fuseki configurable enough so that a
> > > > downstream 3rd party can add their security solution that is suitable
> > > > for their environment. But we should not incorporate a particular
> > > > security solution that relies on the deployment environment.
> > > >
> > > > ----
> > > >
> > > > I've asked for more information about the claim on a performance
> > > > motivator and some other background information.
> > > >
> > > > The usage patterns are not yet clear. The data is described as "a one
> > > > graph per handful of subjects and their properties" and "100s of
> > > > graphs". What the queries are is unstated.
> > >
> > > Right now, each graph has in the range of 300-500 triples (though the
> > > amount depends on how much additional/domain-specific metadata
> > > end-users choose to add) and the scale of deployed Fuseki datasets
> > > range from having a few to ~6k graphs.
> > > Since we'd like to allow end-users to run **any** queries they wish
> > > (we enforce query timeouts), it's difficult to give concrete examples.
> > > I can however say that TDB unionDefaultGraph mode is enabled (i.e.
> > > most end-users won't choose to explicitly target a specific graph) and
> > > that one of our representative "search" queries (which combines
> > > GeoSPARQL + multiple explicit property matching across multiple
> > > different subjects in a UNION + subsequent collection of mandatory &
> > > optional fields) is between 20-40% faster than the current custom
> > > solution.
> > > (Note that we have also tried query re-writing to insert FROM/FROM
> > > NAMED clauses - and that is very slow in comparison, presumably to the
> > > higher level filtering involved, unlike the quad filter herein.)
> > >
> > > >
> > > > There is no characterisation of the queries being made. If we are
> > > > talking about overheads, the cases of a few big queries and many small
> > > > queries are different.
> > >
> > > (pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
> > > adding a certain set of graphs makes the queries on my laptop take:
> > > ~600 graphs ~115ms
> > > ~1500 graphs ~162ms
> > > ~3k graphs ~240ms
> > > ~6k graphs ~400ms
> > >
> > > >
> > > > The scale looks small (less than a million triples of triples -
> > > > approximating as 100 graphs * 1000 triples). That makes the point about
> > > > access to TDB hooks a bit redundant.
> > >
> > > The dataset I've tested this with has ~1.8M triples. That's not to say
> > > this is the scale we're hoping to satisfy - that's the just what I
> > > tested with first. By redundant, do you mean an alternative approach
> > > should be used for this scale?
> > >
> > > >
> > > >
> > > > There is are distinguished users. A request from one of these users
> > > > causes the set of visible graphs to be read from a comment at the start
> > > > of the query text in the request.
> > > >
> > > > The use of large numbers of small named graphs to manage security
> > > > settings looks to me like triple-level security.  I have already
> > > > mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
> > > > is triple level attribute-based security.
> > >
> > > It could well be that I'm seeing the wrong solution for the feature
> > > we're trying to support (that's the other reason for reaching out to
> > > the community. The reason (rightly or wrongly) to model this as a set
> > > of graphs is: Each set of triples to be restricted are related, but
> > > span multiple subjects and could also relate to other subjects in
> > > other sets (as well as externally).
> > > Hence I couldn't see how e.g. Jena Permissions could be applied here:
> > > When you're provided with a single triple to check - you would have to
> > > understand what type subject it is and how it relates to the "top
> > > level" subject to which the ACL applies. Bundling everything into a
> > > graph seemed like viable option.
> > >
> > > >
> > > > Concern 1:
> > > >
> > > > This by passes Fuseki-provided security and puts the control function
> > > > outside the Fuseki server in a separate server that is not part of Jena.
> > > > It will only be secure if deployed in a constrained network environment.
> > > >
> > > > This is not secure except when run in a certain way and, personally, I
> > > > don't want to have to deal with a CVE because of that. CVE handling is
> > > > time consuming.
> > > >
> > > > I don't see why it is using jena-access (the named graph security
> > > > feature) except for the filtering on TDB. It is creating a dynamic
> > > > dataset for the query.
> > >
> > > You're right - it's only as secure as the middleware/proxy/whatever in
> > > front of it which supplies the ACL. (This was never intended to be
> > > used/exposed to end-users directly.)
> > > The purpose of extending jena-access (instead of immediately writing
> > > it as a separate module) was to illustrate with minimal code changes
> > > (+ extension of existing tests) what it could look like, for
> > > discussion. (The quad filtering / performance aspect would be the
> > > same, regardless of location, I presume.)
> > >
> > > >
> > > > Concern 2: How does update fit into the picture? (GSP is not supported).
> > >
> > > I thought that, since GSP operations target a single graph, there is
> > > no need to extend support to it since it's already possible to
> > > restrict visibility (with the graph query parameter). Am I missing
> > > something?
> > >
> > > >
> > > > Concern 3: It looks like a specific solution for a specific scenario.
> > > > Will it get uptake by the wide Jena user community?
> > >
> > > It's definitely specific. My thinking was that, if a subset of this
> > > were deemed useful, then it'd be better to exist as part of the core
> > > offering as opposed to us just bolting it on ourselves (at my job).
> > > But, if that's not the case - fair enough.
> > >
> > > >
> > > > Concern 4: Is there long-term support and maintenance for the feature?
> > > > (e.g. 5y+)
> > > > How do we respond to users@ message about it? Is it experimental code or
> > > > has it been used for real? Is the feature set stable?
> > >
> > > My understanding is that jena-access is classed as stable (we're using
> > > it for something else already in production) and thus, since this
> > > merely produces a SecurityContext with a larger set of graphs, would
> > > theoretically be no less stable.
> > >
> > > >
> > > >
> > > > Opinion: it is not unreasonable to provide support for this kind of
> > > > customization of Fuseki.
> > > >
> > > > An extension can then provide whatever security is needed for the
> > > > situation and it is the Fuseki user/operator making the decisions about
> > > > what is acceptable security and what isn't.
> > > >
> > > > Fuseki has ways to add custom processors and this seems the way to
> > > > provide an alternative way to make queries.
> > > >
> > > > Putting it in the distribution codebase is a big step for the project.
> > > > At the very least, it needs to be mature and likely to be used.
> > >
> > > We wouldn't be reaching out if we weren't likely to want to use such a
> > > feature. All these concerns/questions/suggestions are exactly what we
> > > were hoping for. If I can provide any more context/tests/samples, let
> > > me know.
> > > (I completely get the concerns about diluting a known security feature
> > > and have no issue with something like this being a separate
> > > component.)
> > >
> > > >
> > > > Background: Currently jena-access is in Fuseki main. It is not optional
> > > > because it predates Fuseki modules.
> > > >
> > > >      Andy
> > >
> > >
> > >
> > > --
> > > Vilnis Termanis
> > > Technical Specialist
> > >
> > > e | vilnis.termanis@iotics.com
> > > www.iotics.com
> > >
>
>
>
> --
> Vilnis Termanis
> Technical Specialist
>
> e | vilnis.termanis@iotics.com
> www.iotics.com
>
> The information contained in this email is strictly confidential and
> intended only for the parties noted. If this email was not intended
> for your use, please contact Iotics. For more on our Privacy Policy
> please visit https://www.iotics.com/legal/

Re: About JENA-2339 - security related

Posted by Vilnis Termanis <Vi...@iotics.com.INVALID>.

(inline)

On Fri, 29 Jul 2022 at 07:56, Martynas Jusevičius
<ma...@atomgraph.com> wrote:
>
> “Sets of triples” — aren’t these datasets?
>
> Couldn’t this use case be addressed by maintaining per-user datasets? Not
> sure if Fuseki can create datasets on the fly, but this seems like a much
> simpler feature to implement compared to a whole new ACL mechanism.

The idea is, that if you had these "sets of triples" A-Z, one user
might be allowed to see A-M and another C-Q. With per-user datasets
you'd have to duplicate data to achieve that. And, when the ACL
changes, you'd have to copy/move triples from one dataset to another.
(Or am I missing a nuance to your proposal? Do you mean dynamically
creating a new dataset which references graphs from another dataset?)

>
> On Thu, 28 Jul 2022 at 22.51, Vilnis Termanis
> <Vi...@iotics.com.invalid> wrote:
>
> > Hi Andy & Jena development community,
> >
> > (Answers inline - apologies if I repeat myself)
> >
> > FYI - Our aim is to enable end-users to make SPARQL queries whilst
> > respecting visibility restrictions.
> > I.e. users (indirectly) add sets of related triples to a dataset and
> > they can choose who has visibility (beyond themselves) over these,
> > either: Nobody, Everyone or a chosen set (which can be updated). Note
> > that this restriction is not by a specific subject or predicate.
> > (Although the sets of triples do have relationships - not all of them
> > are known in advance.)
> >
> > On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org> wrote:
> > >
> > > JENA-2339
> > > PR#1441
> > >
> > https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
> > >
> > > tl;dr:
> > >
> > > It is a different role for Fuseki.
> > >
> > > Fuseki execute the security but the setup and control is from a trusted
> > > external server on the request execution path.
> > >
> > > It assumes certain deployment environments to be safe.
> >
> > FYI - In our case this means that we have a "make SPARQL query" API
> > call. When received, the applicable user (our domain) is known and, in
> > the proposed PR, we can prepend the set of allowed graphs to the query
> > (which have been looked up prior to query execution, externally). The
> > end user has NO direct access to Fuseki itself.
> >
> > >
> > > My feeling is that we should make Fuseki configurable enough so that a
> > > downstream 3rd party can add their security solution that is suitable
> > > for their environment. But we should not incorporate a particular
> > > security solution that relies on the deployment environment.
> > >
> > > ----
> > >
> > > I've asked for more information about the claim on a performance
> > > motivator and some other background information.
> > >
> > > The usage patterns are not yet clear. The data is described as "a one
> > > graph per handful of subjects and their properties" and "100s of
> > > graphs". What the queries are is unstated.
> >
> > Right now, each graph has in the range of 300-500 triples (though the
> > amount depends on how much additional/domain-specific metadata
> > end-users choose to add) and the scale of deployed Fuseki datasets
> > range from having a few to ~6k graphs.
> > Since we'd like to allow end-users to run **any** queries they wish
> > (we enforce query timeouts), it's difficult to give concrete examples.
> > I can however say that TDB unionDefaultGraph mode is enabled (i.e.
> > most end-users won't choose to explicitly target a specific graph) and
> > that one of our representative "search" queries (which combines
> > GeoSPARQL + multiple explicit property matching across multiple
> > different subjects in a UNION + subsequent collection of mandatory &
> > optional fields) is between 20-40% faster than the current custom
> > solution.
> > (Note that we have also tried query re-writing to insert FROM/FROM
> > NAMED clauses - and that is very slow in comparison, presumably to the
> > higher level filtering involved, unlike the quad filter herein.)
> >
> > >
> > > There is no characterisation of the queries being made. If we are
> > > talking about overheads, the cases of a few big queries and many small
> > > queries are different.
> >
> > (pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
> > adding a certain set of graphs makes the queries on my laptop take:
> > ~600 graphs ~115ms
> > ~1500 graphs ~162ms
> > ~3k graphs ~240ms
> > ~6k graphs ~400ms
> >
> > >
> > > The scale looks small (less than a million triples of triples -
> > > approximating as 100 graphs * 1000 triples). That makes the point about
> > > access to TDB hooks a bit redundant.
> >
> > The dataset I've tested this with has ~1.8M triples. That's not to say
> > this is the scale we're hoping to satisfy - that's the just what I
> > tested with first. By redundant, do you mean an alternative approach
> > should be used for this scale?
> >
> > >
> > >
> > > There is are distinguished users. A request from one of these users
> > > causes the set of visible graphs to be read from a comment at the start
> > > of the query text in the request.
> > >
> > > The use of large numbers of small named graphs to manage security
> > > settings looks to me like triple-level security.  I have already
> > > mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
> > > is triple level attribute-based security.
> >
> > It could well be that I'm seeing the wrong solution for the feature
> > we're trying to support (that's the other reason for reaching out to
> > the community. The reason (rightly or wrongly) to model this as a set
> > of graphs is: Each set of triples to be restricted are related, but
> > span multiple subjects and could also relate to other subjects in
> > other sets (as well as externally).
> > Hence I couldn't see how e.g. Jena Permissions could be applied here:
> > When you're provided with a single triple to check - you would have to
> > understand what type subject it is and how it relates to the "top
> > level" subject to which the ACL applies. Bundling everything into a
> > graph seemed like viable option.
> >
> > >
> > > Concern 1:
> > >
> > > This by passes Fuseki-provided security and puts the control function
> > > outside the Fuseki server in a separate server that is not part of Jena.
> > > It will only be secure if deployed in a constrained network environment.
> > >
> > > This is not secure except when run in a certain way and, personally, I
> > > don't want to have to deal with a CVE because of that. CVE handling is
> > > time consuming.
> > >
> > > I don't see why it is using jena-access (the named graph security
> > > feature) except for the filtering on TDB. It is creating a dynamic
> > > dataset for the query.
> >
> > You're right - it's only as secure as the middleware/proxy/whatever in
> > front of it which supplies the ACL. (This was never intended to be
> > used/exposed to end-users directly.)
> > The purpose of extending jena-access (instead of immediately writing
> > it as a separate module) was to illustrate with minimal code changes
> > (+ extension of existing tests) what it could look like, for
> > discussion. (The quad filtering / performance aspect would be the
> > same, regardless of location, I presume.)
> >
> > >
> > > Concern 2: How does update fit into the picture? (GSP is not supported).
> >
> > I thought that, since GSP operations target a single graph, there is
> > no need to extend support to it since it's already possible to
> > restrict visibility (with the graph query parameter). Am I missing
> > something?
> >
> > >
> > > Concern 3: It looks like a specific solution for a specific scenario.
> > > Will it get uptake by the wide Jena user community?
> >
> > It's definitely specific. My thinking was that, if a subset of this
> > were deemed useful, then it'd be better to exist as part of the core
> > offering as opposed to us just bolting it on ourselves (at my job).
> > But, if that's not the case - fair enough.
> >
> > >
> > > Concern 4: Is there long-term support and maintenance for the feature?
> > > (e.g. 5y+)
> > > How do we respond to users@ message about it? Is it experimental code or
> > > has it been used for real? Is the feature set stable?
> >
> > My understanding is that jena-access is classed as stable (we're using
> > it for something else already in production) and thus, since this
> > merely produces a SecurityContext with a larger set of graphs, would
> > theoretically be no less stable.
> >
> > >
> > >
> > > Opinion: it is not unreasonable to provide support for this kind of
> > > customization of Fuseki.
> > >
> > > An extension can then provide whatever security is needed for the
> > > situation and it is the Fuseki user/operator making the decisions about
> > > what is acceptable security and what isn't.
> > >
> > > Fuseki has ways to add custom processors and this seems the way to
> > > provide an alternative way to make queries.
> > >
> > > Putting it in the distribution codebase is a big step for the project.
> > > At the very least, it needs to be mature and likely to be used.
> >
> > We wouldn't be reaching out if we weren't likely to want to use such a
> > feature. All these concerns/questions/suggestions are exactly what we
> > were hoping for. If I can provide any more context/tests/samples, let
> > me know.
> > (I completely get the concerns about diluting a known security feature
> > and have no issue with something like this being a separate
> > component.)
> >
> > >
> > > Background: Currently jena-access is in Fuseki main. It is not optional
> > > because it predates Fuseki modules.
> > >
> > >      Andy
> >
> >
> >
> > --
> > Vilnis Termanis
> > Technical Specialist
> >
> > e | vilnis.termanis@iotics.com
> > www.iotics.com
> >



-- 
Vilnis Termanis
Technical Specialist

e | vilnis.termanis@iotics.com
www.iotics.com

The information contained in this email is strictly confidential and
intended only for the parties noted. If this email was not intended
for your use, please contact Iotics. For more on our Privacy Policy
please visit https://www.iotics.com/legal/

Re: About JENA-2339 - security related

Posted by Andy Seaborne <an...@apache.org>.


On 29/07/2022 07:56, Martynas Jusevičius wrote:
> “Sets of triples” — aren’t these datasets?
> 
> Couldn’t this use case be addressed by maintaining per-user datasets? Not
> sure if Fuseki can create datasets on the fly,

Yes, it can - and remove them. That's what the UI does for "create 
dataset", "delete dataset".

Dispatch is dynamic, based on a registry, not fixed, like it would be if 
web.xml were used.

     Andy

> but this seems like a much
> simpler feature to implement compared to a whole new ACL mechanism.
> 
> On Thu, 28 Jul 2022 at 22.51, Vilnis Termanis
> <Vi...@iotics.com.invalid> wrote:
> 
>> Hi Andy & Jena development community,
>>
>> (Answers inline - apologies if I repeat myself)
>>
>> FYI - Our aim is to enable end-users to make SPARQL queries whilst
>> respecting visibility restrictions.
>> I.e. users (indirectly) add sets of related triples to a dataset and
>> they can choose who has visibility (beyond themselves) over these,
>> either: Nobody, Everyone or a chosen set (which can be updated). Note
>> that this restriction is not by a specific subject or predicate.
>> (Although the sets of triples do have relationships - not all of them
>> are known in advance.)
>>
>> On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org> wrote:
>>>
>>> JENA-2339
>>> PR#1441
>>>
>> https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
>>>
>>> tl;dr:
>>>
>>> It is a different role for Fuseki.
>>>
>>> Fuseki execute the security but the setup and control is from a trusted
>>> external server on the request execution path.
>>>
>>> It assumes certain deployment environments to be safe.
>>
>> FYI - In our case this means that we have a "make SPARQL query" API
>> call. When received, the applicable user (our domain) is known and, in
>> the proposed PR, we can prepend the set of allowed graphs to the query
>> (which have been looked up prior to query execution, externally). The
>> end user has NO direct access to Fuseki itself.
>>
>>>
>>> My feeling is that we should make Fuseki configurable enough so that a
>>> downstream 3rd party can add their security solution that is suitable
>>> for their environment. But we should not incorporate a particular
>>> security solution that relies on the deployment environment.
>>>
>>> ----
>>>
>>> I've asked for more information about the claim on a performance
>>> motivator and some other background information.
>>>
>>> The usage patterns are not yet clear. The data is described as "a one
>>> graph per handful of subjects and their properties" and "100s of
>>> graphs". What the queries are is unstated.
>>
>> Right now, each graph has in the range of 300-500 triples (though the
>> amount depends on how much additional/domain-specific metadata
>> end-users choose to add) and the scale of deployed Fuseki datasets
>> range from having a few to ~6k graphs.
>> Since we'd like to allow end-users to run **any** queries they wish
>> (we enforce query timeouts), it's difficult to give concrete examples.
>> I can however say that TDB unionDefaultGraph mode is enabled (i.e.
>> most end-users won't choose to explicitly target a specific graph) and
>> that one of our representative "search" queries (which combines
>> GeoSPARQL + multiple explicit property matching across multiple
>> different subjects in a UNION + subsequent collection of mandatory &
>> optional fields) is between 20-40% faster than the current custom
>> solution.
>> (Note that we have also tried query re-writing to insert FROM/FROM
>> NAMED clauses - and that is very slow in comparison, presumably to the
>> higher level filtering involved, unlike the quad filter herein.)
>>
>>>
>>> There is no characterisation of the queries being made. If we are
>>> talking about overheads, the cases of a few big queries and many small
>>> queries are different.
>>
>> (pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
>> adding a certain set of graphs makes the queries on my laptop take:
>> ~600 graphs ~115ms
>> ~1500 graphs ~162ms
>> ~3k graphs ~240ms
>> ~6k graphs ~400ms
>>
>>>
>>> The scale looks small (less than a million triples of triples -
>>> approximating as 100 graphs * 1000 triples). That makes the point about
>>> access to TDB hooks a bit redundant.
>>
>> The dataset I've tested this with has ~1.8M triples. That's not to say
>> this is the scale we're hoping to satisfy - that's the just what I
>> tested with first. By redundant, do you mean an alternative approach
>> should be used for this scale?
>>
>>>
>>>
>>> There is are distinguished users. A request from one of these users
>>> causes the set of visible graphs to be read from a comment at the start
>>> of the query text in the request.
>>>
>>> The use of large numbers of small named graphs to manage security
>>> settings looks to me like triple-level security.  I have already
>>> mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
>>> is triple level attribute-based security.
>>
>> It could well be that I'm seeing the wrong solution for the feature
>> we're trying to support (that's the other reason for reaching out to
>> the community. The reason (rightly or wrongly) to model this as a set
>> of graphs is: Each set of triples to be restricted are related, but
>> span multiple subjects and could also relate to other subjects in
>> other sets (as well as externally).
>> Hence I couldn't see how e.g. Jena Permissions could be applied here:
>> When you're provided with a single triple to check - you would have to
>> understand what type subject it is and how it relates to the "top
>> level" subject to which the ACL applies. Bundling everything into a
>> graph seemed like viable option.
>>
>>>
>>> Concern 1:
>>>
>>> This by passes Fuseki-provided security and puts the control function
>>> outside the Fuseki server in a separate server that is not part of Jena.
>>> It will only be secure if deployed in a constrained network environment.
>>>
>>> This is not secure except when run in a certain way and, personally, I
>>> don't want to have to deal with a CVE because of that. CVE handling is
>>> time consuming.
>>>
>>> I don't see why it is using jena-access (the named graph security
>>> feature) except for the filtering on TDB. It is creating a dynamic
>>> dataset for the query.
>>
>> You're right - it's only as secure as the middleware/proxy/whatever in
>> front of it which supplies the ACL. (This was never intended to be
>> used/exposed to end-users directly.)
>> The purpose of extending jena-access (instead of immediately writing
>> it as a separate module) was to illustrate with minimal code changes
>> (+ extension of existing tests) what it could look like, for
>> discussion. (The quad filtering / performance aspect would be the
>> same, regardless of location, I presume.)
>>
>>>
>>> Concern 2: How does update fit into the picture? (GSP is not supported).
>>
>> I thought that, since GSP operations target a single graph, there is
>> no need to extend support to it since it's already possible to
>> restrict visibility (with the graph query parameter). Am I missing
>> something?
>>
>>>
>>> Concern 3: It looks like a specific solution for a specific scenario.
>>> Will it get uptake by the wide Jena user community?
>>
>> It's definitely specific. My thinking was that, if a subset of this
>> were deemed useful, then it'd be better to exist as part of the core
>> offering as opposed to us just bolting it on ourselves (at my job).
>> But, if that's not the case - fair enough.
>>
>>>
>>> Concern 4: Is there long-term support and maintenance for the feature?
>>> (e.g. 5y+)
>>> How do we respond to users@ message about it? Is it experimental code or
>>> has it been used for real? Is the feature set stable?
>>
>> My understanding is that jena-access is classed as stable (we're using
>> it for something else already in production) and thus, since this
>> merely produces a SecurityContext with a larger set of graphs, would
>> theoretically be no less stable.
>>
>>>
>>>
>>> Opinion: it is not unreasonable to provide support for this kind of
>>> customization of Fuseki.
>>>
>>> An extension can then provide whatever security is needed for the
>>> situation and it is the Fuseki user/operator making the decisions about
>>> what is acceptable security and what isn't.
>>>
>>> Fuseki has ways to add custom processors and this seems the way to
>>> provide an alternative way to make queries.
>>>
>>> Putting it in the distribution codebase is a big step for the project.
>>> At the very least, it needs to be mature and likely to be used.
>>
>> We wouldn't be reaching out if we weren't likely to want to use such a
>> feature. All these concerns/questions/suggestions are exactly what we
>> were hoping for. If I can provide any more context/tests/samples, let
>> me know.
>> (I completely get the concerns about diluting a known security feature
>> and have no issue with something like this being a separate
>> component.)
>>
>>>
>>> Background: Currently jena-access is in Fuseki main. It is not optional
>>> because it predates Fuseki modules.
>>>
>>>       Andy
>>
>>
>>
>> --
>> Vilnis Termanis
>> Technical Specialist
>>
>> e | vilnis.termanis@iotics.com
>> www.iotics.com
>>
>

Re: About JENA-2339 - security related

Posted by Martynas Jusevičius <ma...@atomgraph.com>.

“Sets of triples” — aren’t these datasets?

Couldn’t this use case be addressed by maintaining per-user datasets? Not
sure if Fuseki can create datasets on the fly, but this seems like a much
simpler feature to implement compared to a whole new ACL mechanism.

On Thu, 28 Jul 2022 at 22.51, Vilnis Termanis
<Vi...@iotics.com.invalid> wrote:

> Hi Andy & Jena development community,
>
> (Answers inline - apologies if I repeat myself)
>
> FYI - Our aim is to enable end-users to make SPARQL queries whilst
> respecting visibility restrictions.
> I.e. users (indirectly) add sets of related triples to a dataset and
> they can choose who has visibility (beyond themselves) over these,
> either: Nobody, Everyone or a chosen set (which can be updated). Note
> that this restriction is not by a specific subject or predicate.
> (Although the sets of triples do have relationships - not all of them
> are known in advance.)
>
> On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org> wrote:
> >
> > JENA-2339
> > PR#1441
> >
> https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
> >
> > tl;dr:
> >
> > It is a different role for Fuseki.
> >
> > Fuseki execute the security but the setup and control is from a trusted
> > external server on the request execution path.
> >
> > It assumes certain deployment environments to be safe.
>
> FYI - In our case this means that we have a "make SPARQL query" API
> call. When received, the applicable user (our domain) is known and, in
> the proposed PR, we can prepend the set of allowed graphs to the query
> (which have been looked up prior to query execution, externally). The
> end user has NO direct access to Fuseki itself.
>
> >
> > My feeling is that we should make Fuseki configurable enough so that a
> > downstream 3rd party can add their security solution that is suitable
> > for their environment. But we should not incorporate a particular
> > security solution that relies on the deployment environment.
> >
> > ----
> >
> > I've asked for more information about the claim on a performance
> > motivator and some other background information.
> >
> > The usage patterns are not yet clear. The data is described as "a one
> > graph per handful of subjects and their properties" and "100s of
> > graphs". What the queries are is unstated.
>
> Right now, each graph has in the range of 300-500 triples (though the
> amount depends on how much additional/domain-specific metadata
> end-users choose to add) and the scale of deployed Fuseki datasets
> range from having a few to ~6k graphs.
> Since we'd like to allow end-users to run **any** queries they wish
> (we enforce query timeouts), it's difficult to give concrete examples.
> I can however say that TDB unionDefaultGraph mode is enabled (i.e.
> most end-users won't choose to explicitly target a specific graph) and
> that one of our representative "search" queries (which combines
> GeoSPARQL + multiple explicit property matching across multiple
> different subjects in a UNION + subsequent collection of mandatory &
> optional fields) is between 20-40% faster than the current custom
> solution.
> (Note that we have also tried query re-writing to insert FROM/FROM
> NAMED clauses - and that is very slow in comparison, presumably to the
> higher level filtering involved, unlike the quad filter herein.)
>
> >
> > There is no characterisation of the queries being made. If we are
> > talking about overheads, the cases of a few big queries and many small
> > queries are different.
>
> (pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
> adding a certain set of graphs makes the queries on my laptop take:
> ~600 graphs ~115ms
> ~1500 graphs ~162ms
> ~3k graphs ~240ms
> ~6k graphs ~400ms
>
> >
> > The scale looks small (less than a million triples of triples -
> > approximating as 100 graphs * 1000 triples). That makes the point about
> > access to TDB hooks a bit redundant.
>
> The dataset I've tested this with has ~1.8M triples. That's not to say
> this is the scale we're hoping to satisfy - that's the just what I
> tested with first. By redundant, do you mean an alternative approach
> should be used for this scale?
>
> >
> >
> > There is are distinguished users. A request from one of these users
> > causes the set of visible graphs to be read from a comment at the start
> > of the query text in the request.
> >
> > The use of large numbers of small named graphs to manage security
> > settings looks to me like triple-level security.  I have already
> > mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
> > is triple level attribute-based security.
>
> It could well be that I'm seeing the wrong solution for the feature
> we're trying to support (that's the other reason for reaching out to
> the community. The reason (rightly or wrongly) to model this as a set
> of graphs is: Each set of triples to be restricted are related, but
> span multiple subjects and could also relate to other subjects in
> other sets (as well as externally).
> Hence I couldn't see how e.g. Jena Permissions could be applied here:
> When you're provided with a single triple to check - you would have to
> understand what type subject it is and how it relates to the "top
> level" subject to which the ACL applies. Bundling everything into a
> graph seemed like viable option.
>
> >
> > Concern 1:
> >
> > This by passes Fuseki-provided security and puts the control function
> > outside the Fuseki server in a separate server that is not part of Jena.
> > It will only be secure if deployed in a constrained network environment.
> >
> > This is not secure except when run in a certain way and, personally, I
> > don't want to have to deal with a CVE because of that. CVE handling is
> > time consuming.
> >
> > I don't see why it is using jena-access (the named graph security
> > feature) except for the filtering on TDB. It is creating a dynamic
> > dataset for the query.
>
> You're right - it's only as secure as the middleware/proxy/whatever in
> front of it which supplies the ACL. (This was never intended to be
> used/exposed to end-users directly.)
> The purpose of extending jena-access (instead of immediately writing
> it as a separate module) was to illustrate with minimal code changes
> (+ extension of existing tests) what it could look like, for
> discussion. (The quad filtering / performance aspect would be the
> same, regardless of location, I presume.)
>
> >
> > Concern 2: How does update fit into the picture? (GSP is not supported).
>
> I thought that, since GSP operations target a single graph, there is
> no need to extend support to it since it's already possible to
> restrict visibility (with the graph query parameter). Am I missing
> something?
>
> >
> > Concern 3: It looks like a specific solution for a specific scenario.
> > Will it get uptake by the wide Jena user community?
>
> It's definitely specific. My thinking was that, if a subset of this
> were deemed useful, then it'd be better to exist as part of the core
> offering as opposed to us just bolting it on ourselves (at my job).
> But, if that's not the case - fair enough.
>
> >
> > Concern 4: Is there long-term support and maintenance for the feature?
> > (e.g. 5y+)
> > How do we respond to users@ message about it? Is it experimental code or
> > has it been used for real? Is the feature set stable?
>
> My understanding is that jena-access is classed as stable (we're using
> it for something else already in production) and thus, since this
> merely produces a SecurityContext with a larger set of graphs, would
> theoretically be no less stable.
>
> >
> >
> > Opinion: it is not unreasonable to provide support for this kind of
> > customization of Fuseki.
> >
> > An extension can then provide whatever security is needed for the
> > situation and it is the Fuseki user/operator making the decisions about
> > what is acceptable security and what isn't.
> >
> > Fuseki has ways to add custom processors and this seems the way to
> > provide an alternative way to make queries.
> >
> > Putting it in the distribution codebase is a big step for the project.
> > At the very least, it needs to be mature and likely to be used.
>
> We wouldn't be reaching out if we weren't likely to want to use such a
> feature. All these concerns/questions/suggestions are exactly what we
> were hoping for. If I can provide any more context/tests/samples, let
> me know.
> (I completely get the concerns about diluting a known security feature
> and have no issue with something like this being a separate
> component.)
>
> >
> > Background: Currently jena-access is in Fuseki main. It is not optional
> > because it predates Fuseki modules.
> >
> >      Andy
>
>
>
> --
> Vilnis Termanis
> Technical Specialist
>
> e | vilnis.termanis@iotics.com
> www.iotics.com
>

Re: About JENA-2339 - security related

Posted by Vilnis Termanis <Vi...@iotics.com.INVALID>.

Hi Andy & Jena development community,

(Answers inline - apologies if I repeat myself)

FYI - Our aim is to enable end-users to make SPARQL queries whilst
respecting visibility restrictions.
I.e. users (indirectly) add sets of related triples to a dataset and
they can choose who has visibility (beyond themselves) over these,
either: Nobody, Everyone or a chosen set (which can be updated). Note
that this restriction is not by a specific subject or predicate.
(Although the sets of triples do have relationships - not all of them
are known in advance.)

On Thu, 28 Jul 2022 at 10:43, Andy Seaborne <an...@apache.org> wrote:
>
> JENA-2339
> PR#1441
> https://github.com/vtermanis/jena/blob/dynamic-graph-restriction-extension/MOVE_ME_DynamicACL_notes.md
>
> tl;dr:
>
> It is a different role for Fuseki.
>
> Fuseki execute the security but the setup and control is from a trusted
> external server on the request execution path.
>
> It assumes certain deployment environments to be safe.

FYI - In our case this means that we have a "make SPARQL query" API
call. When received, the applicable user (our domain) is known and, in
the proposed PR, we can prepend the set of allowed graphs to the query
(which have been looked up prior to query execution, externally). The
end user has NO direct access to Fuseki itself.

>
> My feeling is that we should make Fuseki configurable enough so that a
> downstream 3rd party can add their security solution that is suitable
> for their environment. But we should not incorporate a particular
> security solution that relies on the deployment environment.
>
> ----
>
> I've asked for more information about the claim on a performance
> motivator and some other background information.
>
> The usage patterns are not yet clear. The data is described as "a one
> graph per handful of subjects and their properties" and "100s of
> graphs". What the queries are is unstated.

Right now, each graph has in the range of 300-500 triples (though the
amount depends on how much additional/domain-specific metadata
end-users choose to add) and the scale of deployed Fuseki datasets
range from having a few to ~6k graphs.
Since we'd like to allow end-users to run **any** queries they wish
(we enforce query timeouts), it's difficult to give concrete examples.
I can however say that TDB unionDefaultGraph mode is enabled (i.e.
most end-users won't choose to explicitly target a specific graph) and
that one of our representative "search" queries (which combines
GeoSPARQL + multiple explicit property matching across multiple
different subjects in a UNION + subsequent collection of mandatory &
optional fields) is between 20-40% faster than the current custom
solution.
(Note that we have also tried query re-writing to insert FROM/FROM
NAMED clauses - and that is very slow in comparison, presumably to the
higher level filtering involved, unlike the quad filter herein.)

>
> There is no characterisation of the queries being made. If we are
> talking about overheads, the cases of a few big queries and many small
> queries are different.

(pasted from JENA-2339 ticket) - using a "SELECT {} 1" query, and
adding a certain set of graphs makes the queries on my laptop take:
~600 graphs ~115ms
~1500 graphs ~162ms
~3k graphs ~240ms
~6k graphs ~400ms

>
> The scale looks small (less than a million triples of triples -
> approximating as 100 graphs * 1000 triples). That makes the point about
> access to TDB hooks a bit redundant.

The dataset I've tested this with has ~1.8M triples. That's not to say
this is the scale we're hoping to satisfy - that's the just what I
tested with first. By redundant, do you mean an alternative approach
should be used for this scale?

>
>
> There is are distinguished users. A request from one of these users
> causes the set of visible graphs to be read from a comment at the start
> of the query text in the request.
>
> The use of large numbers of small named graphs to manage security
> settings looks to me like triple-level security.  I have already
> mentioned work "FMod_ABAC": (£job related) awhile back (2/Jan/2022). It
> is triple level attribute-based security.

It could well be that I'm seeing the wrong solution for the feature
we're trying to support (that's the other reason for reaching out to
the community. The reason (rightly or wrongly) to model this as a set
of graphs is: Each set of triples to be restricted are related, but
span multiple subjects and could also relate to other subjects in
other sets (as well as externally).
Hence I couldn't see how e.g. Jena Permissions could be applied here:
When you're provided with a single triple to check - you would have to
understand what type subject it is and how it relates to the "top
level" subject to which the ACL applies. Bundling everything into a
graph seemed like viable option.

>
> Concern 1:
>
> This by passes Fuseki-provided security and puts the control function
> outside the Fuseki server in a separate server that is not part of Jena.
> It will only be secure if deployed in a constrained network environment.
>
> This is not secure except when run in a certain way and, personally, I
> don't want to have to deal with a CVE because of that. CVE handling is
> time consuming.
>
> I don't see why it is using jena-access (the named graph security
> feature) except for the filtering on TDB. It is creating a dynamic
> dataset for the query.

You're right - it's only as secure as the middleware/proxy/whatever in
front of it which supplies the ACL. (This was never intended to be
used/exposed to end-users directly.)
The purpose of extending jena-access (instead of immediately writing
it as a separate module) was to illustrate with minimal code changes
(+ extension of existing tests) what it could look like, for
discussion. (The quad filtering / performance aspect would be the
same, regardless of location, I presume.)

>
> Concern 2: How does update fit into the picture? (GSP is not supported).

I thought that, since GSP operations target a single graph, there is
no need to extend support to it since it's already possible to
restrict visibility (with the graph query parameter). Am I missing
something?

>
> Concern 3: It looks like a specific solution for a specific scenario.
> Will it get uptake by the wide Jena user community?

It's definitely specific. My thinking was that, if a subset of this
were deemed useful, then it'd be better to exist as part of the core
offering as opposed to us just bolting it on ourselves (at my job).
But, if that's not the case - fair enough.

>
> Concern 4: Is there long-term support and maintenance for the feature?
> (e.g. 5y+)
> How do we respond to users@ message about it? Is it experimental code or
> has it been used for real? Is the feature set stable?

My understanding is that jena-access is classed as stable (we're using
it for something else already in production) and thus, since this
merely produces a SecurityContext with a larger set of graphs, would
theoretically be no less stable.

>
>
> Opinion: it is not unreasonable to provide support for this kind of
> customization of Fuseki.
>
> An extension can then provide whatever security is needed for the
> situation and it is the Fuseki user/operator making the decisions about
> what is acceptable security and what isn't.
>
> Fuseki has ways to add custom processors and this seems the way to
> provide an alternative way to make queries.
>
> Putting it in the distribution codebase is a big step for the project.
> At the very least, it needs to be mature and likely to be used.

We wouldn't be reaching out if we weren't likely to want to use such a
feature. All these concerns/questions/suggestions are exactly what we
were hoping for. If I can provide any more context/tests/samples, let
me know.
(I completely get the concerns about diluting a known security feature
and have no issue with something like this being a separate
component.)

>
> Background: Currently jena-access is in Fuseki main. It is not optional
> because it predates Fuseki modules.
>
>      Andy

-- 
Vilnis Termanis
Technical Specialist

e | vilnis.termanis@iotics.com
www.iotics.com