You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2015/08/02 21:05:47 UTC

RDFConnection

Stephen, all,

Recently on users@ there was a question about the s-* in java. That got 
me thinking about an interface to pull together all SPARQL operations 
into one application-facing place.  We have jena-jdbc, and jena-client 
already - this is my sketch take.

[1] RDFConnection

Currently, it's a sketch-for-discussion; it's a bit DatasetAccessor-like 
+ SPARQL query + SPARQL Update.  And some whole-dataset-REST-ish 
operations (that Fuseki happens to support).  It's a chance to redo 
things a bit.

RDFConnection uses the existing SPARQL+RDF classes and abstractions in 
ARQ, not strings, [*]  rather than putting all app-visible clases in one 
package.

Adding an equivalent of DatabaseClient to represent one place would be 
good - and add the admin operations, for Fuseki at least.  Also, a 
streaming load possibility.

Comments?
Specific use cases?

	Andy

(multi-operation transactions ... later!)

[*] You can use strings as well - that's the way to get arbitrary 
non-standard extensions through.

[1] 
https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfconnection/RDFConnection.java

Re: RDFConnection

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.
Ah, that makes my distinction pretty meaningless! This abstraction seems meant to rub out just such differences.

This does remind me of another potential nice small feature: a Stream<Triple> construct(Query query) method, maybe at first via QueryExecution::execConstructTriples. The AutoCloseable-ity of QueryExecution could pass through Stream's QueryExecution AutoCloseable-ity. With clever implementation eventually, some of the methods on Stream (e.g. filter) could get passed through to SPARQL execution.

---
A. Soroka
The University of Virginia Library

On Aug 5, 2015, at 9:37 AM, Rob Vesse <rv...@dotnetrdf.org> wrote:

> The main complicating factor is that clear and delete are only separate
> operations if the storage layer stores graph names separately from graph
> data which the SPARQL specification specifically do not require
> 
> For storage systems like TDB where only quads are stored the existence of
> a named graph is predicated by the existence of some quads in that graph
> and so delete is equivalent to clear because if you remove all quads for a
> graph TDB doesn't know about that graph any more
> 
> The SPARQL specifications actually explicitly call this complication out
> in several places (search for empty graphs in the SPARQL 1.1 update spec)
> and various SPARQL Updates behaviours may differ depending on whether the
> storage layer records the presence of empty graphs or not
> 
> Rob
> 
> On 05/08/2015 13:44, "ajs6f@virginia.edu" <aj...@virginia.edu> wrote:
> 
>> Just a thought on "ergonomics": it might be nice to separate "clear" and
>> "delete", so instead of RDFConnection::delete either clearing or deleting
>> a graph depending on whether it is the default graph, you have finer
>> control and can clear a non-default graph.
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Aug 4, 2015, at 6:21 PM, Andy Seaborne <an...@apache.org> wrote:
>> 
>>> There's a note in the interface
>>> 
>>>   // ---- Query
>>>   // Maybe more query forms: querySelect(Query)? select(Query)?
>>> 
>>> At the moment, the operations are the basic ones (the SPARQL protocols
>>> for query, update and GSP).  There's scope to add forms on top.
>>> 
>>> void execSelect(Query query, Consumer<QuerySolution> action)
>>> 
>>> is one possibility.
>>> 
>>> 	Andy
>>> 
>>> On 04/08/15 16:14, ajs6f@virginia.edu wrote:
>>>> Is this a little bit like Sesame 4's new Repository helper type? Not
>>>> totally the same thing, but similar in that it's bringing a lot of
>>>> convenience together around the notion of dataset?
>>>> 
>>>> 
>>>> http://rdf4j.org/doc/4/programming.docbook?view#Stream_based_querying_an
>>>> d_transaction_handling
>>>> 
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>> On Aug 2, 2015, at 3:05 PM, Andy Seaborne <an...@apache.org> wrote:
>>>> 
>>>>> Stephen, all,
>>>>> 
>>>>> Recently on users@ there was a question about the s-* in java. That
>>>>> got me thinking about an interface to pull together all SPARQL
>>>>> operations into one application-facing place.  We have jena-jdbc, and
>>>>> jena-client already - this is my sketch take.
>>>>> 
>>>>> [1] RDFConnection
>>>>> 
>>>>> Currently, it's a sketch-for-discussion; it's a bit
>>>>> DatasetAccessor-like + SPARQL query + SPARQL Update.  And some
>>>>> whole-dataset-REST-ish operations (that Fuseki happens to support).
>>>>> It's a chance to redo things a bit.
>>>>> 
>>>>> RDFConnection uses the existing SPARQL+RDF classes and abstractions
>>>>> in ARQ, not strings, [*]  rather than putting all app-visible clases
>>>>> in one package.
>>>>> 
>>>>> Adding an equivalent of DatabaseClient to represent one place would
>>>>> be good - and add the admin operations, for Fuseki at least.  Also, a
>>>>> streaming load possibility.
>>>>> 
>>>>> Comments?
>>>>> Specific use cases?
>>>>> 
>>>>> 	Andy
>>>>> 
>>>>> (multi-operation transactions ... later!)
>>>>> 
>>>>> [*] You can use strings as well - that's the way to get arbitrary
>>>>> non-standard extensions through.
>>>>> 
>>>>> [1] 
>>>>> https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfco
>>>>> nnection/RDFConnection.java
>>>> 
>>> 
>> 
> 
> 
> 
> 


Re: RDFConnection

Posted by Rob Vesse <rv...@dotnetrdf.org>.
The main complicating factor is that clear and delete are only separate
operations if the storage layer stores graph names separately from graph
data which the SPARQL specification specifically do not require

For storage systems like TDB where only quads are stored the existence of
a named graph is predicated by the existence of some quads in that graph
and so delete is equivalent to clear because if you remove all quads for a
graph TDB doesn't know about that graph any more

The SPARQL specifications actually explicitly call this complication out
in several places (search for empty graphs in the SPARQL 1.1 update spec)
and various SPARQL Updates behaviours may differ depending on whether the
storage layer records the presence of empty graphs or not

Rob

On 05/08/2015 13:44, "ajs6f@virginia.edu" <aj...@virginia.edu> wrote:

>Just a thought on "ergonomics": it might be nice to separate "clear" and
>"delete", so instead of RDFConnection::delete either clearing or deleting
>a graph depending on whether it is the default graph, you have finer
>control and can clear a non-default graph.
>
>---
>A. Soroka
>The University of Virginia Library
>
>On Aug 4, 2015, at 6:21 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> There's a note in the interface
>> 
>>    // ---- Query
>>    // Maybe more query forms: querySelect(Query)? select(Query)?
>> 
>> At the moment, the operations are the basic ones (the SPARQL protocols
>>for query, update and GSP).  There's scope to add forms on top.
>> 
>>  void execSelect(Query query, Consumer<QuerySolution> action)
>> 
>> is one possibility.
>> 
>> 	Andy
>> 
>> On 04/08/15 16:14, ajs6f@virginia.edu wrote:
>>> Is this a little bit like Sesame 4's new Repository helper type? Not
>>>totally the same thing, but similar in that it's bringing a lot of
>>>convenience together around the notion of dataset?
>>> 
>>> 
>>>http://rdf4j.org/doc/4/programming.docbook?view#Stream_based_querying_an
>>>d_transaction_handling
>>> 
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>> 
>>> On Aug 2, 2015, at 3:05 PM, Andy Seaborne <an...@apache.org> wrote:
>>> 
>>>> Stephen, all,
>>>> 
>>>> Recently on users@ there was a question about the s-* in java. That
>>>>got me thinking about an interface to pull together all SPARQL
>>>>operations into one application-facing place.  We have jena-jdbc, and
>>>>jena-client already - this is my sketch take.
>>>> 
>>>> [1] RDFConnection
>>>> 
>>>> Currently, it's a sketch-for-discussion; it's a bit
>>>>DatasetAccessor-like + SPARQL query + SPARQL Update.  And some
>>>>whole-dataset-REST-ish operations (that Fuseki happens to support).
>>>>It's a chance to redo things a bit.
>>>> 
>>>> RDFConnection uses the existing SPARQL+RDF classes and abstractions
>>>>in ARQ, not strings, [*]  rather than putting all app-visible clases
>>>>in one package.
>>>> 
>>>> Adding an equivalent of DatabaseClient to represent one place would
>>>>be good - and add the admin operations, for Fuseki at least.  Also, a
>>>>streaming load possibility.
>>>> 
>>>> Comments?
>>>> Specific use cases?
>>>> 
>>>> 	Andy
>>>> 
>>>> (multi-operation transactions ... later!)
>>>> 
>>>> [*] You can use strings as well - that's the way to get arbitrary
>>>>non-standard extensions through.
>>>> 
>>>> [1] 
>>>>https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfco
>>>>nnection/RDFConnection.java
>>> 
>> 
>





Re: RDFConnection

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.
Just a thought on "ergonomics": it might be nice to separate "clear" and "delete", so instead of RDFConnection::delete either clearing or deleting a graph depending on whether it is the default graph, you have finer control and can clear a non-default graph.

---
A. Soroka
The University of Virginia Library

On Aug 4, 2015, at 6:21 PM, Andy Seaborne <an...@apache.org> wrote:

> There's a note in the interface
> 
>    // ---- Query
>    // Maybe more query forms: querySelect(Query)? select(Query)?
> 
> At the moment, the operations are the basic ones (the SPARQL protocols for query, update and GSP).  There's scope to add forms on top.
> 
>  void execSelect(Query query, Consumer<QuerySolution> action)
> 
> is one possibility.
> 
> 	Andy
> 
> On 04/08/15 16:14, ajs6f@virginia.edu wrote:
>> Is this a little bit like Sesame 4's new Repository helper type? Not totally the same thing, but similar in that it's bringing a lot of convenience together around the notion of dataset?
>> 
>> http://rdf4j.org/doc/4/programming.docbook?view#Stream_based_querying_and_transaction_handling
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Aug 2, 2015, at 3:05 PM, Andy Seaborne <an...@apache.org> wrote:
>> 
>>> Stephen, all,
>>> 
>>> Recently on users@ there was a question about the s-* in java. That got me thinking about an interface to pull together all SPARQL operations into one application-facing place.  We have jena-jdbc, and jena-client already - this is my sketch take.
>>> 
>>> [1] RDFConnection
>>> 
>>> Currently, it's a sketch-for-discussion; it's a bit DatasetAccessor-like + SPARQL query + SPARQL Update.  And some whole-dataset-REST-ish operations (that Fuseki happens to support).  It's a chance to redo things a bit.
>>> 
>>> RDFConnection uses the existing SPARQL+RDF classes and abstractions in ARQ, not strings, [*]  rather than putting all app-visible clases in one package.
>>> 
>>> Adding an equivalent of DatabaseClient to represent one place would be good - and add the admin operations, for Fuseki at least.  Also, a streaming load possibility.
>>> 
>>> Comments?
>>> Specific use cases?
>>> 
>>> 	Andy
>>> 
>>> (multi-operation transactions ... later!)
>>> 
>>> [*] You can use strings as well - that's the way to get arbitrary non-standard extensions through.
>>> 
>>> [1] https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfconnection/RDFConnection.java
>> 
> 


Re: RDFConnection

Posted by Andy Seaborne <an...@apache.org>.
There's a note in the interface

     // ---- Query
     // Maybe more query forms: querySelect(Query)? select(Query)?

At the moment, the operations are the basic ones (the SPARQL protocols 
for query, update and GSP).  There's scope to add forms on top.

   void execSelect(Query query, Consumer<QuerySolution> action)

is one possibility.

	Andy

On 04/08/15 16:14, ajs6f@virginia.edu wrote:
> Is this a little bit like Sesame 4's new Repository helper type? Not totally the same thing, but similar in that it's bringing a lot of convenience together around the notion of dataset?
>
> http://rdf4j.org/doc/4/programming.docbook?view#Stream_based_querying_and_transaction_handling
>
> ---
> A. Soroka
> The University of Virginia Library
>
> On Aug 2, 2015, at 3:05 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> Stephen, all,
>>
>> Recently on users@ there was a question about the s-* in java. That got me thinking about an interface to pull together all SPARQL operations into one application-facing place.  We have jena-jdbc, and jena-client already - this is my sketch take.
>>
>> [1] RDFConnection
>>
>> Currently, it's a sketch-for-discussion; it's a bit DatasetAccessor-like + SPARQL query + SPARQL Update.  And some whole-dataset-REST-ish operations (that Fuseki happens to support).  It's a chance to redo things a bit.
>>
>> RDFConnection uses the existing SPARQL+RDF classes and abstractions in ARQ, not strings, [*]  rather than putting all app-visible clases in one package.
>>
>> Adding an equivalent of DatabaseClient to represent one place would be good - and add the admin operations, for Fuseki at least.  Also, a streaming load possibility.
>>
>> Comments?
>> Specific use cases?
>>
>> 	Andy
>>
>> (multi-operation transactions ... later!)
>>
>> [*] You can use strings as well - that's the way to get arbitrary non-standard extensions through.
>>
>> [1] https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfconnection/RDFConnection.java
>


Re: RDFConnection

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.
Is this a little bit like Sesame 4's new Repository helper type? Not totally the same thing, but similar in that it's bringing a lot of convenience together around the notion of dataset?

http://rdf4j.org/doc/4/programming.docbook?view#Stream_based_querying_and_transaction_handling

---
A. Soroka
The University of Virginia Library

On Aug 2, 2015, at 3:05 PM, Andy Seaborne <an...@apache.org> wrote:

> Stephen, all,
> 
> Recently on users@ there was a question about the s-* in java. That got me thinking about an interface to pull together all SPARQL operations into one application-facing place.  We have jena-jdbc, and jena-client already - this is my sketch take.
> 
> [1] RDFConnection
> 
> Currently, it's a sketch-for-discussion; it's a bit DatasetAccessor-like + SPARQL query + SPARQL Update.  And some whole-dataset-REST-ish operations (that Fuseki happens to support).  It's a chance to redo things a bit.
> 
> RDFConnection uses the existing SPARQL+RDF classes and abstractions in ARQ, not strings, [*]  rather than putting all app-visible clases in one package.
> 
> Adding an equivalent of DatabaseClient to represent one place would be good - and add the admin operations, for Fuseki at least.  Also, a streaming load possibility.
> 
> Comments?
> Specific use cases?
> 
> 	Andy
> 
> (multi-operation transactions ... later!)
> 
> [*] You can use strings as well - that's the way to get arbitrary non-standard extensions through.
> 
> [1] https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfconnection/RDFConnection.java


Re: RDFConnection

Posted by Andy Seaborne <an...@apache.org>.
jena-client has a number of additional classes that overlap with 
existing classes like QueryStatement, UpdateStatement, has the 
operations of QueryExecution. I had assumed this was to isolate the code 
in one package to some degree.  What are they trying to capture?

 > One aspect of your code I do not like is the transaction handling.
 > It is effectively always in "auto-commit" mode

That's an implementation detail for now merely because

 >> (multi-operation transactions ... later!)

Adding multi-operation transactions needs support from the server.  The 
tricky bit for Fuseki is not the protocol, it's handling a transaction 
on different request threads (esp multiple concurrent read).

	Andy

On 04/08/15 22:26, Stephen Allen wrote:
> Andy,
>
> I like the idea.  I've been using the jena-client [1] code to do very
> similar operations for quite a while now.  It actually is pretty feature
> complete at this point (documentation at [2]), and I would like to merge it
> into the official release (it's already up to date against 3.0.0).  I also
> used ARQ's classes instead of trying to extract everything into a public
> API.
>
> I think jena-client provides all of the features (and a few more) that
> you've added except for one implementation detail: GSP support.  Instead,
> when you add or remove DatasetGraphs or Models, it translates that into
> INSERT/DELETE DATA update queries (a fetch command could be implemented by
> building a CONSTRUCT query).  I actually tried to avoid adding GSP support
> for two reasons: 1) it complicates the usage, instead of just a query and
> update endpoint, you also need a GSP endpoint.  And 2) GSP is a subset of
> the query+update functionality.
>
> To my knowledge, the only argument for using GSP instead of just
> query+update would be performance/scalability.  Although, when I have
> encountered those issues, I've attempted to fix the problem in query+update
> instead (i.e. adding streaming support for update).  However, parsing large
> SPARQL INSERT DATA operations is still slower than parsing NT (not to
> mention rdf/thrift).  There are potential solutions for that (a
> sparql/thrift implementation, even if it only did INSERT/DELETE DATA as
> binary and left queries as string blobs), but obviously that doesn't exist
> yet.  Additionally, 3rd party remote stores such as Sesame do not have
> streaming SPARQL update support, and likely won't in the foreseeable future
> as it would be a big undertaking (Sesame uses JavaCC+JJTree to build an AST
> of the update query in memory).  But why limit ourselves based on their
> implementation?
>
> One of the motivating features of jena-client was the ability to perform
> large streaming updates (not just inserts/deletes) to a remote store.  This
> made up somewhat for the lack of remote transactions.  But maybe that isn't
> too great of an argument, when we could just go ahead and implement remote
> transaction support (here is a proposal I haven't worked on in over a year
> [3]).  We're behind Sesame here, they've had remote transactions for quite
> a while now.
>
> One aspect of your code I do not like is the transaction handling.  It is
> effectively always in "auto-commit" mode.  Jena-client's is more like JDBC
> in that the transaction operations are exposed on the Connection object.
> If the user chooses not to use the transaction mechanism then it will
> default to using "auto-commit", but the user can always control it
> explicitly, which is important.  That made it made fairly straightforward
> to build a Spring Transaction Manager [4] class for my company's project.
> It is pretty nice to use Spring annotations to perform transactions.  I
> will release that as a GitHub project if/when jena-client is released.
>
> Maybe we can use jena-client as a base to work from?  If we feel we want to
> add the separate GSP operations, then I think the extension point would be
> to add a new GSP interface similar to Updater [5] (but lacking the generic
> update query functionality).
>
> I can start working on moving it from SVN's experimental branch to the Git
> master tomorrow (I also have to update the documentation to use a lambda
> instead of an anonymous inner class in the example).
>
> -Stephen
>
> [1] https://svn.apache.org/repos/asf/jena/Experimental/jena-client/
> [2]
> https://svn.apache.org/repos/asf/jena/Experimental/jena-client/jena-client.mdtext
> [3] http://people.apache.org/~sallen/sparql11-transaction/
> [4]
> http://docs.spring.io/spring/docs/current/spring-framework-reference/html/transaction.html
> [5]
> https://svn.apache.org/repos/asf/jena/Experimental/jena-client/src/main/java/org/apache/jena/client/Updater.java
>
>
> On Sun, Aug 2, 2015 at 3:05 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> Stephen, all,
>>
>> Recently on users@ there was a question about the s-* in java. That got
>> me thinking about an interface to pull together all SPARQL operations into
>> one application-facing place.  We have jena-jdbc, and jena-client already -
>> this is my sketch take.
>>
>> [1] RDFConnection
>>
>> Currently, it's a sketch-for-discussion; it's a bit DatasetAccessor-like +
>> SPARQL query + SPARQL Update.  And some whole-dataset-REST-ish operations
>> (that Fuseki happens to support).  It's a chance to redo things a bit.
>>
>> RDFConnection uses the existing SPARQL+RDF classes and abstractions in
>> ARQ, not strings, [*]  rather than putting all app-visible clases in one
>> package.
>>
>> Adding an equivalent of DatabaseClient to represent one place would be
>> good - and add the admin operations, for Fuseki at least.  Also, a
>> streaming load possibility.
>>
>> Comments?
>> Specific use cases?
>>
>>          Andy
>>
>> (multi-operation transactions ... later!)
>>
>> [*] You can use strings as well - that's the way to get arbitrary
>> non-standard extensions through.
>>
>> [1]
>> https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfconnection/RDFConnection.java
>>
>


Re: RDFConnection

Posted by Andy Seaborne <an...@apache.org>.
Here is a summary/draft to try to pull the discussions together:

There would be three main interfaces: one application facing and two for 
the two SPARQL protocols.

(all names provisional!)

== RDFConnection

* The application facing API

*  RDFConnectionFactory (name?) to make the things.

* It builds on the two SPARQL protocols.

* Autocommit provided (client-side)

* No mention of QueryExecution

Results processed in a style that means RDFConnection gets to manage the 
result set.  Probably also one operation to execute and copy the results 
because this is a recurring support area.  Tryi to get smart and pass 
aroudn a stream uis

* Composed of SPARQLProtocolConnection (query+update) and 
SPARQLGraphStoreProtocol

* It is easier to add operations than remove them esp from a 
application=facing API like RDFConnection so start cautious. There could 
be useful compound operations or ones applicable only sometimes, but for 
now, roughly a 1-1 match to a SPARQL operation.

= SPARQLProtocol
* The operations of Query, Update

* Explicit transactions only (client-side)

= SPARQLGraphStoreProtocol

* DatasetAccessor with renaming to make it clear how the operations 
refer to HTTP operations - we might as well call them gspGET, gspPOST, 
gspPUT, gspDELETE or something like that and have RDFConnection have 
task focused names (e.g. loadFile, addModel, ...)

* Deprecate DatasetAccessor (and DatasetGraphAccessor?)

* Explicit transactions only (client-side)

== Notes/Questions:

* loadFile can be done two ways - GSP and INSERT DATA (GSP is more 
efficient - but if no GSP handlers is available, it can switch to SPARQl 
Protocol means.

* Is bulk delete a major requirement (more a question of how much to 
design for it - not whether to have it or not e.g. may be only in 
SPARQLProtocol).

* QueryExecution (or indeed the QueryStatement version) is going to be a 
problem because calling the operation only sets it up, not actually 
executes it.  And I would like to avoid, at least to the application 
getting into excessive nesting of try-with-resources.  That is valuable 
for RDFConnection itself.

JDBC has statement objects for some reasons that don't apply to Jena:
Prepared statements (server side)  and parameterised queries (Jena has 
different mechanisms - may show in RDFConnection and happen client side)


* non-HTTP remote connection in the future.

Always tricky to plan for the unknown!  I think we should be aware this 
may happen but not worry too much now  (I used to put in designs for all 
possibilities but looking back, they never end up right and do create 
legacy baggage all too easily.)

== QueryExecution

(For SPARQLProtocol, not RDFConnection)

I'd like to refactor this and not create a new, separate interface that 
that does the same thing.  So at a minimum, a super interface for exec*

With more change, one option is to remove (via deprecation cycle) 
get/set initial binding and require providing it at the factory step.

The getDataset and getQuery operations are more convenience - we could 
remove or continue to document they may return null.  As the query 
carries prefixes information for presentation, retaining getQuery makes 
sense to me. Remove getDataset?

	Andy


Re: RDFConnection

Posted by Stephen Allen <sa...@apache.org>.
On Wed, Aug 5, 2015 at 1:14 PM, Andy Seaborne <an...@apache.org> wrote:

> On 04/08/15 22:26, Stephen Allen wrote:
>
>> To my knowledge, the only argument for using GSP instead of just
>> query+update would be performance/scalability.  Although, when I have
>> encountered those issues, I've attempted to fix the problem in
>> query+update
>> instead (i.e. adding streaming support for update).  However, parsing
>> large
>> SPARQL INSERT DATA operations is still slower than parsing NT (not to
>> mention rdf/thrift).  There are potential solutions for that (a
>> sparql/thrift implementation, even if it only did INSERT/DELETE DATA as
>> binary and left queries as string blobs), but obviously that doesn't exist
>> yet.
>>
> ...
>
>> One of the motivating features of jena-client was the ability to perform
>> large streaming updates (not just inserts/deletes) to a remote store.
>> This
>> made up somewhat for the lack of remote transactions.  But maybe that
>> isn't
>> too great of an argument, when we could just go ahead and implement remote
>> transaction support (here is a proposal I haven't worked on in over a year
>> [3]).
>>
>
> GSP is very useful for managing data in a store when combined with a union
> of named graphs as the default.  Units of the overall graph can be deleted
> (bnodes included) and replaced.
>
> It's also useful when scripting management of the data : using curl/wget
> you manage a store in simple scripts.  Being able to do that in the same
> way in Java is helpful so the user does not need two paradigms.
>

Sure, definitely makes sense.  It does seem like we can provide both
mechanisms in a straightforward way.


> Fuseki2 provides streaming updates for upload by GSP. RDFConnection has
> file upload features so the client-side does not need to parse the file,
> just pass an InputStream to HTTP layer.


Makes sense.  Jena-client doesn't do that because it has to transform it
into an update query, but obviously pays some penalties while doing that.


> RDFConnection adds the natural REST ops on datasets.
>
>
> Authentication:  we should use the HttpOp code - one reason is that it
> supports authentication for all HTTP verbs.
>
>
Agreed, jena-client uses the HttpOp code.



> Jena-client's is more like JDBC
>> in that the transaction operations are exposed on the Connection object.
>> If the user chooses not to use the transaction mechanism then it will
>> default to using "auto-commit"
>>
>
> Agreed and in fact there an issue here with autocommit, streaming and and
> SELECT queries.  The ResultSet is passed out of the execSelect operation
> but needs to be inside the transaction.  Autocommit defeats that.
>

Yes, I tried to mitigate that with the AutoCommitQueryExecution class.  It
wraps the QueryExecution used on a local dataset and then enforces
transactions semantics between the exec*() and close() methods.  Obviously
it relies on the user to call close() (or better yet use a
try-with-resources) on the corresponding QueryStatement (they never see the
QueryExecution object directly).


>
> Which touches on the JDBC issue that drivers tend to execute and receive
> all the results before the client can start working on the answers
> (sometimes there are ways round this to be used with care).  The issue is
> badly behaved clients hogging resources in the server.
>

We could default to copying results into memory if we wanted, but provide
an override to disable that.


>
> Some possibilities:
> 0/ Don't support autocommit.  In the local case that is quite natural;
> less so for the remote case because HTTP is not about state.
>
> (I looked more at the remote case - e.g. the local connection
> implementation isolates results to get the same semantics as remote.)
>
> 1/ Autocommit cases receive the results completely.  Some idioms don't
> work in autocommit more.
>
> 2/ An operation to make sure the QueryExecution is inside a transaction
> and also closed.
>
> RDFConnection
> public default void querySelect(Query query,
>                                 Consumer<QuerySolution> rowAction) {
>     Txn.executeRead(this, ()->{
>         try ( QueryExecution qExec = query(query) ) {
>             qExec.execSelect().forEachRemaining(rowAction);
>         }
>     } ) ;
> }
>

Although I think that using a Consumer like you do in 2) is a great way of
doing things (this is exclusively how we allow queries in our app), perhaps
that functionality should be built as a utility class on top of lower-level
functionality that does let you shoot yourself in the foot if you like.
Then strongly encourage users to do it the safe way.



>
> By the way - I added explicit transaction support and some example usage.
>
> Maybe we can use jena-client as a base to work from?  If we feel we want to
>> add the separate GSP operations, then I think the extension point would be
>> to add a new GSP interface similar to Updater [5] (but lacking the generic
>> update query functionality).
>>
>
> I have no problem with jena-client as the starting point, I want to
> understand its design first.
>
> I'm not seeing what the separate interfaces and *Statement gives the
> application - maybe I'm missing something here - it does seem to make it
> more complicated compared to just performing the operation.  For
> *Statement, it's still limited in scope to the connection but can be passed
> out.
>
>
The reason for all the interfaces was to ease different implementations.
There is already the local dataset and remote cases, and as you mention
below, possibly some other non-HTTP case.  Additionally perhaps 3rd parties
might want to do a different implementation.

The main reason for the new QueryStatement and UpdateStatement classes was
because QueryExecution (and similarly UpdateProcessor) has methods that
seemed inappropriate:
  * setInitialBinding(QuerySolution) - This is not SPARQL, and further only
works with local datasets
  * getDataset() - Mostly doesn't make sense for remote datasets (because
of blank nodes).  Also the remote case would have to fetch everything
eagerly.
  * getContext() - Only for local datasets
  * getQuery() - This method could make sense to add to QueryStatement.
Implies parsing the query client-side

But these are relatively minor reasons.



> Please remove the Sesame comments in javadoc and documentation.  There's
> no need to put comments about another community on implementation choices
> that can change in javadoc and documentation.  If you want to write up the
> reasons then have a blog item somewhere, and hence making it more time
> specific.
>
>
Yep, you are correct, was trying to be overly helpful.  Removed.



> We might want to consider a non-HTTP remote connection; at least design
> for the possibility.   My motivation was initially more around working with
> other people's published data (i.e. a long way away, not same data centre).
>
>
Yeah, that would be a good idea.  The HTTP protocol does impose some
annoying limitations, especially with transactions and duplex communication.

-Stephen

Re: RDFConnection

Posted by Andy Seaborne <an...@apache.org>.
On 04/08/15 22:26, Stephen Allen wrote:
> To my knowledge, the only argument for using GSP instead of just
> query+update would be performance/scalability.  Although, when I have
> encountered those issues, I've attempted to fix the problem in query+update
> instead (i.e. adding streaming support for update).  However, parsing large
> SPARQL INSERT DATA operations is still slower than parsing NT (not to
> mention rdf/thrift).  There are potential solutions for that (a
> sparql/thrift implementation, even if it only did INSERT/DELETE DATA as
> binary and left queries as string blobs), but obviously that doesn't exist
> yet.
...
> One of the motivating features of jena-client was the ability to perform
> large streaming updates (not just inserts/deletes) to a remote store.  This
> made up somewhat for the lack of remote transactions.  But maybe that isn't
> too great of an argument, when we could just go ahead and implement remote
> transaction support (here is a proposal I haven't worked on in over a year
> [3]).

GSP is very useful for managing data in a store when combined with a 
union of named graphs as the default.  Units of the overall graph can be 
deleted (bnodes included) and replaced.

It's also useful when scripting management of the data : using curl/wget 
you manage a store in simple scripts.  Being able to do that in the same 
way in Java is helpful so the user does not need two paradigms.

Fuseki2 provides streaming updates for upload by GSP. RDFConnection has 
file upload features so the client-side does not need to parse the file, 
just pass an InputStream to HTTP layer.

RDFConnection adds the natural REST ops on datasets.


Authentication:  we should use the HttpOp code - one reason is that it 
supports authentication for all HTTP verbs.

> Jena-client's is more like JDBC
> in that the transaction operations are exposed on the Connection object.
> If the user chooses not to use the transaction mechanism then it will
> default to using "auto-commit"

Agreed and in fact there an issue here with autocommit, streaming and 
and SELECT queries.  The ResultSet is passed out of the execSelect 
operation but needs to be inside the transaction.  Autocommit defeats that.

Which touches on the JDBC issue that drivers tend to execute and receive 
all the results before the client can start working on the answers 
(sometimes there are ways round this to be used with care).  The issue 
is badly behaved clients hogging resources in the server.


Some possibilities:
0/ Don't support autocommit.  In the local case that is quite natural; 
less so for the remote case because HTTP is not about state.

(I looked more at the remote case - e.g. the local connection 
implementation isolates results to get the same semantics as remote.)

1/ Autocommit cases receive the results completely.  Some idioms don't 
work in autocommit more.

2/ An operation to make sure the QueryExecution is inside a transaction 
and also closed.

RDFConnection
public default void querySelect(Query query,
                                 Consumer<QuerySolution> rowAction) {
     Txn.executeRead(this, ()->{
         try ( QueryExecution qExec = query(query) ) {
             qExec.execSelect().forEachRemaining(rowAction);
         }
     } ) ;
}

By the way - I added explicit transaction support and some example usage.

> Maybe we can use jena-client as a base to work from?  If we feel we want to
> add the separate GSP operations, then I think the extension point would be
> to add a new GSP interface similar to Updater [5] (but lacking the generic
> update query functionality).

I have no problem with jena-client as the starting point, I want to 
understand its design first.

I'm not seeing what the separate interfaces and *Statement gives the 
application - maybe I'm missing something here - it does seem to make it 
more complicated compared to just performing the operation.  For 
*Statement, it's still limited in scope to the connection but can be 
passed out.

Please remove the Sesame comments in javadoc and documentation.  There's 
no need to put comments about another community on implementation 
choices that can change in javadoc and documentation.  If you want to 
write up the reasons then have a blog item somewhere, and hence making 
it more time specific.

We might want to consider a non-HTTP remote connection; at least design 
for the possibility.   My motivation was initially more around working 
with other people's published data (i.e. a long way away, not same data 
centre).

	Andy


Re: RDFConnection

Posted by Stephen Allen <sa...@apache.org>.
Andy,

I like the idea.  I've been using the jena-client [1] code to do very
similar operations for quite a while now.  It actually is pretty feature
complete at this point (documentation at [2]), and I would like to merge it
into the official release (it's already up to date against 3.0.0).  I also
used ARQ's classes instead of trying to extract everything into a public
API.

I think jena-client provides all of the features (and a few more) that
you've added except for one implementation detail: GSP support.  Instead,
when you add or remove DatasetGraphs or Models, it translates that into
INSERT/DELETE DATA update queries (a fetch command could be implemented by
building a CONSTRUCT query).  I actually tried to avoid adding GSP support
for two reasons: 1) it complicates the usage, instead of just a query and
update endpoint, you also need a GSP endpoint.  And 2) GSP is a subset of
the query+update functionality.

To my knowledge, the only argument for using GSP instead of just
query+update would be performance/scalability.  Although, when I have
encountered those issues, I've attempted to fix the problem in query+update
instead (i.e. adding streaming support for update).  However, parsing large
SPARQL INSERT DATA operations is still slower than parsing NT (not to
mention rdf/thrift).  There are potential solutions for that (a
sparql/thrift implementation, even if it only did INSERT/DELETE DATA as
binary and left queries as string blobs), but obviously that doesn't exist
yet.  Additionally, 3rd party remote stores such as Sesame do not have
streaming SPARQL update support, and likely won't in the foreseeable future
as it would be a big undertaking (Sesame uses JavaCC+JJTree to build an AST
of the update query in memory).  But why limit ourselves based on their
implementation?

One of the motivating features of jena-client was the ability to perform
large streaming updates (not just inserts/deletes) to a remote store.  This
made up somewhat for the lack of remote transactions.  But maybe that isn't
too great of an argument, when we could just go ahead and implement remote
transaction support (here is a proposal I haven't worked on in over a year
[3]).  We're behind Sesame here, they've had remote transactions for quite
a while now.

One aspect of your code I do not like is the transaction handling.  It is
effectively always in "auto-commit" mode.  Jena-client's is more like JDBC
in that the transaction operations are exposed on the Connection object.
If the user chooses not to use the transaction mechanism then it will
default to using "auto-commit", but the user can always control it
explicitly, which is important.  That made it made fairly straightforward
to build a Spring Transaction Manager [4] class for my company's project.
It is pretty nice to use Spring annotations to perform transactions.  I
will release that as a GitHub project if/when jena-client is released.

Maybe we can use jena-client as a base to work from?  If we feel we want to
add the separate GSP operations, then I think the extension point would be
to add a new GSP interface similar to Updater [5] (but lacking the generic
update query functionality).

I can start working on moving it from SVN's experimental branch to the Git
master tomorrow (I also have to update the documentation to use a lambda
instead of an anonymous inner class in the example).

-Stephen

[1] https://svn.apache.org/repos/asf/jena/Experimental/jena-client/
[2]
https://svn.apache.org/repos/asf/jena/Experimental/jena-client/jena-client.mdtext
[3] http://people.apache.org/~sallen/sparql11-transaction/
[4]
http://docs.spring.io/spring/docs/current/spring-framework-reference/html/transaction.html
[5]
https://svn.apache.org/repos/asf/jena/Experimental/jena-client/src/main/java/org/apache/jena/client/Updater.java


On Sun, Aug 2, 2015 at 3:05 PM, Andy Seaborne <an...@apache.org> wrote:

> Stephen, all,
>
> Recently on users@ there was a question about the s-* in java. That got
> me thinking about an interface to pull together all SPARQL operations into
> one application-facing place.  We have jena-jdbc, and jena-client already -
> this is my sketch take.
>
> [1] RDFConnection
>
> Currently, it's a sketch-for-discussion; it's a bit DatasetAccessor-like +
> SPARQL query + SPARQL Update.  And some whole-dataset-REST-ish operations
> (that Fuseki happens to support).  It's a chance to redo things a bit.
>
> RDFConnection uses the existing SPARQL+RDF classes and abstractions in
> ARQ, not strings, [*]  rather than putting all app-visible clases in one
> package.
>
> Adding an equivalent of DatabaseClient to represent one place would be
> good - and add the admin operations, for Fuseki at least.  Also, a
> streaming load possibility.
>
> Comments?
> Specific use cases?
>
>         Andy
>
> (multi-operation transactions ... later!)
>
> [*] You can use strings as well - that's the way to get arbitrary
> non-standard extensions through.
>
> [1]
> https://github.com/afs/AFS-Dev/blob/master/src/main/java/projects/rdfconnection/RDFConnection.java
>