You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Stephen Allen <sa...@apache.org> on 2013/01/31 05:33:23 UTC

Streaming SPARQL Updates

All,

I've finally completed the work I was doing on streaming SPARQL Updates.  Yay!

There should be no user visible changes from this update, but there
will be some changes for 3rd party storage engines.  Specifically they
need to implement a couple of new methods in their UpdateEngineFactory
implementation.  See the changes to
jena-tdb/com.hp.hpl.jena.tdb.modify.UpdateEngineTDB in revision
1440841 to see what probably needs to be done for your engine.

What this change enables:
  1) Update requests that can contain an unlimited number of queries
  2) Update queries that can contain an unlimited number of quads
inside of INSERT DATA and DELETE DATA queries
  3) If the underlying storage engine is transactional, then Fuseki
can now stream the updates directly to the engine, theoretically
allowing an unlimited number of quads to be inserted in a single
query*

To exploit these new capabilities to the fullest, you probably need to
use jena-client, which is able create streaming update requests both
for local datasets, and remote HTTP endpoints.

Please let me know if you notice any problems.  As this is a pretty
big feature change, I would appreciate test reports if you have time.

-Stephen


* I believe TDB still buffers uncommitted changes in memory, so there
is still a limit here (but it is a lot better than before, which
stored multiple copies of the update request in memory).  You can test
this unlimited feature by using GraphStoreNull, which will just throw
away new quads (use the assembler attached to JENA-330 to stand up a
Fuseki instance using GraphStoreNull).

Re: Streaming SPARQL Updates

Posted by Andy Seaborne <an...@apache.org>.

On 05/02/13 21:04, Stephen Allen wrote:
> So UpdateAction, UpdateExecutionFactory, UpdateProcessor,
> UpdateProcessorStreaming are part of the user API and should not have
> changed much.

Are there notes to describe the changes?

Time to start thinking about 2.10.0!

 > UpdateAction was a little screwy,

It's also convenience operations - if the app wants detailed control, 
then it should not assume convenience code is the only way to interact 
with the system.

	Andy

Re: Streaming SPARQL Updates

Posted by Andy Seaborne <an...@apache.org>.

On 06/02/13 19:09, Stephen Allen wrote:
> On Wed, Feb 6, 2013 at 12:40 PM, Andy Seaborne <an...@apache.org> wrote:
>> (tried to send and earlier version of this - also I put in some changes
>> then, on reflection, didn't like them and already reversed these changes)
>>
>>
>> UpdateEngine.getUpdateSink
>>
>> I'm confused here - is it a getter or a creator?
>>
>> The comments that were in UpdateEngineMain and the impl creates on each
>> call.
>>
>> I've made it a getter-style on fixed UpdateSink for the single SPARQL Update
>> request.
>>
>
> I guess it should probably be a getter as you have changed it.  I'm
> not too clear on it myself.

I've just done the same to UpdateEngineNonStreaming -- my use case is a 
single UpdateProcessorStreaming that is passed between software elements 
that accumulates Updates from different places.  Each calls 
getUpdateSink to get the sink - so it had better be the same one.

> Was going to ask why you made those changes, but then saw you reverted them :)

And I thought I'd sorted out the interfaces quite nicely until, 
afterwards, I realised I hadn't at all. :-(  Too much DNS wrestling 
causing a distraction.

> I've also been thinking about the 2.10.0 release.  What parts do you
> think I need to document more?  The big change for users is the
> removal of the initial binding capability.  For implementers it is the
> change to UpdateEngine/UpdateEngineFactory, but I think it should be
> fairly straightforward if they look at UpdateEngineMain or
> UpdateEngineNonStreaming.

IMO, a paragraph to give information on the new feature of streaming 
updates, then an overview to say that engine implementers need to be 
aware so that they don't first encounter this by compile errors or 
runtime errors, then have to look in the code.  Entries in ARQ release 
notes are necessary but not sufficiently visible.

An overview of the reasons for 2.10 would go in the notes that go in the 
announcement.

For initial binding, it's a different audience, but again top-level text 
that points out the change.

Ditto internal graph simplification, and RIOT reader changes.

	Andy

Re: Streaming SPARQL Updates

Posted by Stephen Allen <sa...@apache.org>.

On Wed, Feb 6, 2013 at 12:40 PM, Andy Seaborne <an...@apache.org> wrote:
> (tried to send and earlier version of this - also I put in some changes
> then, on reflection, didn't like them and already reversed these changes)
>
>
> UpdateEngine.getUpdateSink
>
> I'm confused here - is it a getter or a creator?
>
> The comments that were in UpdateEngineMain and the impl creates on each
> call.
>
> I've made it a getter-style on fixed UpdateSink for the single SPARQL Update
> request.
>

I guess it should probably be a getter as you have changed it.  I'm
not too clear on it myself.

Was going to ask why you made those changes, but then saw you reverted them :)

I've also been thinking about the 2.10.0 release.  What parts do you
think I need to document more?  The big change for users is the
removal of the initial binding capability.  For implementers it is the
change to UpdateEngine/UpdateEngineFactory, but I think it should be
fairly straightforward if they look at UpdateEngineMain or
UpdateEngineNonStreaming.

-Stephen

Re: Streaming SPARQL Updates

Posted by Andy Seaborne <an...@apache.org>.

(tried to send and earlier version of this - also I put in some changes 
then, on reflection, didn't like them and already reversed these changes)


UpdateEngine.getUpdateSink

I'm confused here - is it a getter or a creator?

The comments that were in UpdateEngineMain and the impl creates on each 
call.

I've made it a getter-style on fixed UpdateSink for the single SPARQL 
Update request.

	Andy

Re: Streaming SPARQL Updates

Posted by Stephen Allen <sa...@apache.org>.

So UpdateAction, UpdateExecutionFactory, UpdateProcessor,
UpdateProcessorStreaming are part of the user API and should not have
changed much.  They are the way to interact with the update system.
For storage engine implementers, what has changed is that UpdateEngine
is now stream based.  If, as a storage provider, you for some reason
cannot process updates in a streaming fashion, then you may simply
buffer up all of the update operations and then do something with
them.  I've added an example of how to do that with the
UpdateEngineNonStreaming class.

UpdateAction was a little screwy, it was basically doing the work that
UpdateProcessorBase should have been doing.  I've simplified that bit
a bunch.

Let me know if you still have questions about how to implement an
UpdateEngine for your purposes.  I will also be going through and
pulling out the initial binding as recommended by Andy, so there will
still be more changes to UpdateEngine and UpdateEngineFactory.

-Stephen

On Tue, Feb 5, 2013 at 12:33 PM, Rob Vesse <rv...@yarcdata.com> wrote:
> Actually part of my email was still valid, UpdateAction.execute() would
> not have fallen back to non-streaming, should now do this as of my latest
> commit
>
> Rob
>
>
> On 2/5/13 9:19 AM, "Rob Vesse" <rv...@cray.com> wrote:
>
>>Ignore my last email, having problems with Eclipse auto-complete and
>>workspace versioning sync issues, also my jet lagged brain having flown
>>back into the US only yesterday ;)
>>
>>Garghhh
>>
>>Rob
>>
>>
>>On 2/5/13 9:16 AM, "Rob Vesse" <rv...@cray.com> wrote:
>>
>>>Ok I am slightly stumped by the latest changes
>>>
>>>UpdateAction.execute() is broken in that it only tries to do streaming
>>>updates and can't fall back to non-streaming.
>>>
>>>UpdateProcessor no longer has any execute() method so is the assumption
>>>simply that the update happens when startRequest() is called or when
>>>finishRequest() is called????
>>>
>>>We will only ever be using non-streaming updates and fundamentally cannot
>>>change to streaming updates to architectural constraints so I don't want
>>>to be stuck with a crippled non-streaming API
>>>
>>>Rob
>>>
>>>
>>>On 2/5/13 3:09 PM, "Andy Seaborne" <an...@apache.org> wrote:
>>>
>>>>> I have unified the two update paths into a single path.  I also
>>>>> removed the ability to pass in an UpdateRequest into the
>>>>> UpdateEngineFactory.  In fact, I've eliminated all the places you
>>>>> could pass one in.  Since with the streaming capability, we won't be
>>>>> able to have one.  This required one change to the public API,
>>>>> GraphStore.startRequest() and .finishRequest() no longer take an
>>>>> UpdateRequest as an argument.  This shouldn't be too much burden for
>>>>> end users and implementors to adapt to.  Also there is no
>>>>> UpdateRequest available via the execution Context object (as it won't
>>>>> be know ahead of time).
>>>>
>>>>Great!
>>>>
>>>>We'll need some documentation to explain the migration, hopefully more
>>>>for the pre-release cycle than the release as that's when the extenders
>>>>should be aware.
>>>>
>>>>I have fixed up the tests to work by removing the initial binding update
>>>>tests.
>>>>
>>>>Currently, 3 tests, + support method, are commented out, with a note to
>>>>say the commented code can be removed completely post 2.10.
>>>>
>>>>I have also cleaned up javadoc warnings (no such variable; class not
>>>>imported).
>>>>
>>>>> Things left to do:
>>>>>    1) Eliminate *or* deprecate the ability to pass in an initial
>>>>> binding for update requests
>>>>
>>>>IMO Remove.
>>>>
>>>>If it the interface is change at all, remove, to avoid two changes, or
>>>>more likely, deprecation for years.
>>>>
>>>>>    2) More javadocs around the UpdateEngine for implementors
>>>>
>>>>?? Or something on the website + link in javadoc -- your call
>>>>
>>>>>    3) Change the name/operation of UpdateVisitor
>>>>
>>>>Could you explain that please?
>>>>
>>>>>
>>>>> -Stephen
>>>>>
>>>>
>>>
>>
>

Re: Streaming SPARQL Updates

Posted by Rob Vesse <rv...@yarcdata.com>.

Actually part of my email was still valid, UpdateAction.execute() would
not have fallen back to non-streaming, should now do this as of my latest
commit

Rob


On 2/5/13 9:19 AM, "Rob Vesse" <rv...@cray.com> wrote:

>Ignore my last email, having problems with Eclipse auto-complete and
>workspace versioning sync issues, also my jet lagged brain having flown
>back into the US only yesterday ;)
>
>Garghhh
>
>Rob
>
>
>On 2/5/13 9:16 AM, "Rob Vesse" <rv...@cray.com> wrote:
>
>>Ok I am slightly stumped by the latest changes
>>
>>UpdateAction.execute() is broken in that it only tries to do streaming
>>updates and can't fall back to non-streaming.
>>
>>UpdateProcessor no longer has any execute() method so is the assumption
>>simply that the update happens when startRequest() is called or when
>>finishRequest() is called????
>>
>>We will only ever be using non-streaming updates and fundamentally cannot
>>change to streaming updates to architectural constraints so I don't want
>>to be stuck with a crippled non-streaming API
>>
>>Rob
>>
>>
>>On 2/5/13 3:09 PM, "Andy Seaborne" <an...@apache.org> wrote:
>>
>>>> I have unified the two update paths into a single path.  I also
>>>> removed the ability to pass in an UpdateRequest into the
>>>> UpdateEngineFactory.  In fact, I've eliminated all the places you
>>>> could pass one in.  Since with the streaming capability, we won't be
>>>> able to have one.  This required one change to the public API,
>>>> GraphStore.startRequest() and .finishRequest() no longer take an
>>>> UpdateRequest as an argument.  This shouldn't be too much burden for
>>>> end users and implementors to adapt to.  Also there is no
>>>> UpdateRequest available via the execution Context object (as it won't
>>>> be know ahead of time).
>>>
>>>Great!
>>>
>>>We'll need some documentation to explain the migration, hopefully more
>>>for the pre-release cycle than the release as that's when the extenders
>>>should be aware.
>>>
>>>I have fixed up the tests to work by removing the initial binding update
>>>tests.
>>>
>>>Currently, 3 tests, + support method, are commented out, with a note to
>>>say the commented code can be removed completely post 2.10.
>>>
>>>I have also cleaned up javadoc warnings (no such variable; class not
>>>imported).
>>>
>>>> Things left to do:
>>>>    1) Eliminate *or* deprecate the ability to pass in an initial
>>>> binding for update requests
>>>
>>>IMO Remove.
>>>
>>>If it the interface is change at all, remove, to avoid two changes, or
>>>more likely, deprecation for years.
>>>
>>>>    2) More javadocs around the UpdateEngine for implementors
>>>
>>>?? Or something on the website + link in javadoc -- your call
>>>
>>>>    3) Change the name/operation of UpdateVisitor
>>>
>>>Could you explain that please?
>>>
>>>>
>>>> -Stephen
>>>>
>>>
>>
>

Re: Streaming SPARQL Updates

Posted by Rob Vesse <rv...@yarcdata.com>.

Ignore my last email, having problems with Eclipse auto-complete and
workspace versioning sync issues, also my jet lagged brain having flown
back into the US only yesterday ;)

Garghhh

Rob


On 2/5/13 9:16 AM, "Rob Vesse" <rv...@cray.com> wrote:

>Ok I am slightly stumped by the latest changes
>
>UpdateAction.execute() is broken in that it only tries to do streaming
>updates and can't fall back to non-streaming.
>
>UpdateProcessor no longer has any execute() method so is the assumption
>simply that the update happens when startRequest() is called or when
>finishRequest() is called????
>
>We will only ever be using non-streaming updates and fundamentally cannot
>change to streaming updates to architectural constraints so I don't want
>to be stuck with a crippled non-streaming API
>
>Rob
>
>
>On 2/5/13 3:09 PM, "Andy Seaborne" <an...@apache.org> wrote:
>
>>> I have unified the two update paths into a single path.  I also
>>> removed the ability to pass in an UpdateRequest into the
>>> UpdateEngineFactory.  In fact, I've eliminated all the places you
>>> could pass one in.  Since with the streaming capability, we won't be
>>> able to have one.  This required one change to the public API,
>>> GraphStore.startRequest() and .finishRequest() no longer take an
>>> UpdateRequest as an argument.  This shouldn't be too much burden for
>>> end users and implementors to adapt to.  Also there is no
>>> UpdateRequest available via the execution Context object (as it won't
>>> be know ahead of time).
>>
>>Great!
>>
>>We'll need some documentation to explain the migration, hopefully more
>>for the pre-release cycle than the release as that's when the extenders
>>should be aware.
>>
>>I have fixed up the tests to work by removing the initial binding update
>>tests.
>>
>>Currently, 3 tests, + support method, are commented out, with a note to
>>say the commented code can be removed completely post 2.10.
>>
>>I have also cleaned up javadoc warnings (no such variable; class not
>>imported).
>>
>>> Things left to do:
>>>    1) Eliminate *or* deprecate the ability to pass in an initial
>>> binding for update requests
>>
>>IMO Remove.
>>
>>If it the interface is change at all, remove, to avoid two changes, or
>>more likely, deprecation for years.
>>
>>>    2) More javadocs around the UpdateEngine for implementors
>>
>>?? Or something on the website + link in javadoc -- your call
>>
>>>    3) Change the name/operation of UpdateVisitor
>>
>>Could you explain that please?
>>
>>>
>>> -Stephen
>>>
>>
>

Re: Streaming SPARQL Updates

Posted by Rob Vesse <rv...@yarcdata.com>.

Ok I am slightly stumped by the latest changes

UpdateAction.execute() is broken in that it only tries to do streaming
updates and can't fall back to non-streaming.

UpdateProcessor no longer has any execute() method so is the assumption
simply that the update happens when startRequest() is called or when
finishRequest() is called????

We will only ever be using non-streaming updates and fundamentally cannot
change to streaming updates to architectural constraints so I don't want
to be stuck with a crippled non-streaming API

Rob


On 2/5/13 3:09 PM, "Andy Seaborne" <an...@apache.org> wrote:

>> I have unified the two update paths into a single path.  I also
>> removed the ability to pass in an UpdateRequest into the
>> UpdateEngineFactory.  In fact, I've eliminated all the places you
>> could pass one in.  Since with the streaming capability, we won't be
>> able to have one.  This required one change to the public API,
>> GraphStore.startRequest() and .finishRequest() no longer take an
>> UpdateRequest as an argument.  This shouldn't be too much burden for
>> end users and implementors to adapt to.  Also there is no
>> UpdateRequest available via the execution Context object (as it won't
>> be know ahead of time).
>
>Great!
>
>We'll need some documentation to explain the migration, hopefully more
>for the pre-release cycle than the release as that's when the extenders
>should be aware.
>
>I have fixed up the tests to work by removing the initial binding update
>tests.
>
>Currently, 3 tests, + support method, are commented out, with a note to
>say the commented code can be removed completely post 2.10.
>
>I have also cleaned up javadoc warnings (no such variable; class not
>imported).
>
>> Things left to do:
>>    1) Eliminate *or* deprecate the ability to pass in an initial
>> binding for update requests
>
>IMO Remove.
>
>If it the interface is change at all, remove, to avoid two changes, or
>more likely, deprecation for years.
>
>>    2) More javadocs around the UpdateEngine for implementors
>
>?? Or something on the website + link in javadoc -- your call
>
>>    3) Change the name/operation of UpdateVisitor
>
>Could you explain that please?
>
>>
>> -Stephen
>>
>

Re: Streaming SPARQL Updates

Posted by Andy Seaborne <an...@apache.org>.

> I have unified the two update paths into a single path.  I also
> removed the ability to pass in an UpdateRequest into the
> UpdateEngineFactory.  In fact, I've eliminated all the places you
> could pass one in.  Since with the streaming capability, we won't be
> able to have one.  This required one change to the public API,
> GraphStore.startRequest() and .finishRequest() no longer take an
> UpdateRequest as an argument.  This shouldn't be too much burden for
> end users and implementors to adapt to.  Also there is no
> UpdateRequest available via the execution Context object (as it won't
> be know ahead of time).

Great!

We'll need some documentation to explain the migration, hopefully more 
for the pre-release cycle than the release as that's when the extenders 
should be aware.

I have fixed up the tests to work by removing the initial binding update 
tests.

Currently, 3 tests, + support method, are commented out, with a note to 
say the commented code can be removed completely post 2.10.

I have also cleaned up javadoc warnings (no such variable; class not 
imported).

> Things left to do:
>    1) Eliminate *or* deprecate the ability to pass in an initial
> binding for update requests

IMO Remove.

If it the interface is change at all, remove, to avoid two changes, or 
more likely, deprecation for years.

>    2) More javadocs around the UpdateEngine for implementors

?? Or something on the website + link in javadoc -- your call

>    3) Change the name/operation of UpdateVisitor

Could you explain that please?

>
> -Stephen
>

Re: Streaming SPARQL Updates

Posted by Stephen Allen <sa...@apache.org>.

On Fri, Feb 1, 2013 at 4:06 PM, Stephen Allen <sa...@apache.org> wrote:
> Replying to both Andy and Rob in this email, comments inline.
>
> So this part about what the proper way to modify the update engine was
> one of the parts I had the most difficulty with.  So feel free to dive
> in and change it where you see it can be done better.
>
>
> On Thu, Jan 31, 2013 at 7:48 AM, Rob Vesse <rv...@yarcdata.com> wrote:
>> Awesome
>>
>> Can we have some javadoc on the new interfaces please?
>>
>> Couple of questions since I now need to at least stub these for our
>> internal implementations.
>>
>> Primarily these revolve around how one opts out of dynamic updates since
>> these won't be usable in our environment due to the peculiarities of our
>> hybrid computing architecture.
>>
>> Is it safe/recommended to just throw an exception in methods like
>> createStreaming() of UpdateEngineFactory if your implementation does not
>> support streaming?
>
> I didn't want to refactor too much, so the streaming support was
> rather grafted on (and not that clear).  Basically there are now two
> codepaths through Fuseki that depend on whether the GraphStore
> implements the Transactional interface.  I think the code may
> currently be considered kind of broken, as it assumes that if you
> implement Transactional, then you can also supply a
> UpdateEngineStreaming.  There is no fallback to a regular
> UpdateEngine.
>
> I think this is rather ugly, and we need a single codepath (based on a
> streamable API).  See below in Andy's comments.
>
>
>> Some possible slight refactors:
>>
>> UpdateEngineBase implements UpdateEngine and UpdateEngineStreaming - can I
>> suggest that we add an UpdateEngineStreamingBase which adds the
>> UpdateEngineStreaming interface so non-streaming implementations can still
>> just extend UpdateEngineBase.  Either that implement UpdateEngineStreaming
>> in UpdateEngineMain not in UpdateEngineBase.
>>
>> The getInsertDataSink() and getDeleteDataSink() on UpdateVisitor seem at
>> odds with a visitor interface, would they not be better off in a separate
>> interface?  I see only one usage of these so them being placed there seems
>> somewhat arbitrary.
>
> You're correct.  And, furthermore, it is inconsistent that queries and
> insert/delete data operations are handled differently.  Ultimately
> maybe the visitor interface isn't appropriate.  Basically I see two
> options:
>
>   1) Update engines need to create and return sinks that can accept
> update operations (see UpdateSink)
>   2) Update engines accept an Iterator of update operations
> (Iterator<Update>, but with UpdateDataInsert and UpdateDataDelete
> changed to return an iterator of quads instead of a list)
>
> Option 2) is conceptually much cleaner from the update engine's point
> of view, but to implement that in ARQ would require a queue and a
> separate thread like PipedRDFStream / Iterator (this is driven by
> JavaCC having to be in charge of parsing).
>
> We currently have something like 1), but it is not very clean since
> the old visitor interface is still there.
>
>>
>> Rob
>>
>
>
>
> On Thu, Jan 31, 2013 at 9:45 AM, Andy Seaborne <an...@apache.org> wrote:
>> Great stuff and getting into 2.10 is good timing.
>>
>> I'm working through the changes so comments may come in a bit of a drip
>> feed, and hence not necessarily in importance order.
>>
>>
>> ** UpdateEngineFactory
>> (default methods interfaces would be nice!)
>>
>> ++++
>> It would be good to hear from from all storage layer implementer on this.
>> ++++
>>
>> If we are making changes, then it would be better to rework this.  I find
>> the two independent update engine concepts confusing and I think it does not
>> help storage implementers.
>
> I agree, it is not clean how it is.  We should solve this now so
> implementer only have to make one change.
>
>
>> When streaming, the ability to look at the request before deciding whether
>> to accept it is lost.  No big deal - I don't think this is ever used.
>> Looking at the request can be done by an indirection engine that decides at
>> the point of execution which real engine to pass calls to.
>
> Yes, drop that feature.
>
>> So if we go for one UpdateEngine interface, UpdateEngineFactory becomes:
>>
>>   public boolean accept(GraphStore graphStore, Context context)
>>   public UpdateEngine create(GraphStore graphStore, Context context)
>
> I like it.
>
>> Initial binding looks like a mistake because an update is multiple
>> operations and historical at best. Remove this (via a deprecation cycle?).
>> Use BIND or VALUES nowadays.
>
> Agreed.  Initial binding should be removed.  I'd like to remove it
> immediately w/o a deprecation cycle because it complicates the
> implementation, and would require storage layer implementers to
> continue to support it.  Also would require another breaking API
> change when we eventually get rid of it.
>
> It is a user facing change, but is it used much in the wild (I would
> guess probably not)?  I would guess that it is probably used a lot
> more on the query side (here we should deprecate instead of remove).
>
>> Require an update engine impl to handle any request - it can always bump to
>> the general engine.
>>
>> The core interface API is streaming and UpdateEngineBase provides stream ->
>> request that can be overridden.
>>
>> (have now seen Rob's comment suggesting interfaces that can be instanceof'ed
>> so a can just extend UpdateEngineBase as before).
>>
>>
>> ** UsingList:
>>
>> When applied to a whole request, this is a protocol feature.  I think it's
>> better to keep it as such.  It seems a small feature for API use thet could
>> be handled via the context (it is the same as rewriting each UpdateModify
>> with no set UsingList to have the global one).
>
> Yes, I implemented this the wrong way by passing it all the way down
> into the storage engine just so it could include it when it was
> building an UpdateSink to return.  Instead a better way would be to
> simply just wrap the UpdateSink that comes from the storage engine
> with one that rewrites UpdateModify objects from the parser before
> passing them on.
>
> This should be an easy fix.
>
>
>> The use case is forming a dataset for the WHERE part of an update from a
>> large collection of graphs.  The storage engine may have opinions here (e.g.
>> security, or only certain fixed shapes) so the selection process is, to my
>> mind, impl specific.
>>
>> (but UpdateWithUsing - could do with the UsingList locally internally!)
>>
>>
>> Alternative proposal:
>>   have two datasets for update - the one to update and the one to query
>>   GraphStore has a getDatasetForWHERE() operation.
>>     We need to track down the use of GraphStore for WHERE
>>     but only in the general purpose engine.
>>     Normally it's "this" - can be reset.
>>   It moves responsibility to GraphStore creation.
>>   No UsingList in the UpdateExecutionFactory API is needed
>>
>> In ARQ, you create an appropriately structures GraphStore with the graphs of
>> choice.  In Fuseki, it uses DynamicDatasets.dynamicDataset
>> (c.f. UpdateEngineWorker.processUsing) for building the view dataset.
>>
>>
>> ** DataSinks
>>
>> I didn't understand what's happening with the sinks, sorry.
>
> The sinks for insert/delete data are a place to shove the quads that
> get generated by the parser.  Instead of storing them in a List<Quad>,
> they instead get passed all the way to the update engine which gave us
> the sink.
>
>> An update request can be more than one operations, including more than one
>> INSERT DATA or DELETE DATA so multiple sinks of both kinds per request?
>>
>> Shouldn't the sinks be per operation?  In the writer for example:
>>
>>     @Override
>>     public void visit(UpdateDataInsert update)
>>     {
>>         Iter.sendToSink(update.getQuads(),
>>                         getInsertDataSink());
>>     }
>>
>> maybe called for several UpdateDataInsert's
>
> Yes, we call getInsertDataSink() per INSERT_DATA operation, and expect
> the storage engine to give us a new one each time.
>
>
> I am going to commit the change for simplifying the UsingList.  Do you
> guys have an opinion on the two options I outlined in the response to
> Rob?  I like 2) from an aesthetics point of view, but I worry a lot
> about having to spawn a separate thread for each update query.  It
> also is a big departure from the current API.  So I will look into how
> to clean up 1).
>
> -Stephen


I have unified the two update paths into a single path.  I also
removed the ability to pass in an UpdateRequest into the
UpdateEngineFactory.  In fact, I've eliminated all the places you
could pass one in.  Since with the streaming capability, we won't be
able to have one.  This required one change to the public API,
GraphStore.startRequest() and .finishRequest() no longer take an
UpdateRequest as an argument.  This shouldn't be too much burden for
end users and implementors to adapt to.  Also there is no
UpdateRequest available via the execution Context object (as it won't
be know ahead of time).


Things left to do:
  1) Eliminate *or* deprecate the ability to pass in an initial
binding for update requests
  2) More javadocs around the UpdateEngine for implementors
  3) Change the name/operation of UpdateVisitor

-Stephen

Re: Streaming SPARQL Updates

Posted by Andy Seaborne <an...@apache.org>.

On 01/02/13 21:06, Stephen Allen wrote:
> Replying to both Andy and Rob in this email, comments inline.
>
...

>> Initial binding looks like a mistake because an update is multiple
>> operations and historical at best. Remove this (via a deprecation cycle?).
>> Use BIND or VALUES nowadays.
>
> Agreed.  Initial binding should be removed.  I'd like to remove it
> immediately w/o a deprecation cycle because it complicates the
> implementation, and would require storage layer implementers to
> continue to support it.  Also would require another breaking API
> change when we eventually get rid of it.
>
> It is a user facing change, but is it used much in the wild (I would
> guess probably not)?  I would guess that it is probably used a lot
> more on the query side (here we should deprecate instead of remove).

I don't want to remove them from the query side - I don't see the need 
to.  It is a long-time feature so I prefer to leave it there.

There is overlap with ParmeterizedQueryString but that isn't a complete 
replacement.

-----

I'm seeing
UpdateEngineFactory.create(GraphStore graphStore, Binding inputBinding, 
Context context)

can inputBinding be removed as well?

>> ** DataSinks
>>
>> I didn't understand what's happening with the sinks, sorry.
>
> The sinks for insert/delete data are a place to shove the quads that
> get generated by the parser.  Instead of storing them in a List<Quad>,
> they instead get passed all the way to the update engine which gave us
> the sink.
>
>> An update request can be more than one operations, including more than one
>> INSERT DATA or DELETE DATA so multiple sinks of both kinds per request?
>>
>> Shouldn't the sinks be per operation?  In the writer for example:
>>
>>      @Override
>>      public void visit(UpdateDataInsert update)
>>      {
>>          Iter.sendToSink(update.getQuads(),
>>                          getInsertDataSink());
>>      }
>>
>> maybe called for several UpdateDataInsert's
>
> Yes, we call getInsertDataSink() per INSERT_DATA operation, and expect
> the storage engine to give us a new one each time.
>
>
> I am going to commit the change for simplifying the UsingList.  Do you
> guys have an opinion on the two options I outlined in the response to
> Rob?  I like 2) from an aesthetics point of view, but I worry a lot
> about having to spawn a separate thread for each update query.  It
> also is a big departure from the current API.  So I will look into how
> to clean up 1).

(2) is nicer at that point but the separate thread may prove to be 
troublesome - such designs have in the past when RDQL (!) was 
multithreaded.  It's OK when things are running to completion but 
clearing up on aborts always seems to end up being messy (e.g. JVMs 
don't exit because of stray threads; thread build up).

So (1) for now - I thing the sink-driven design is only marginally less 
attractive.

Would calling it currentInsertDataSink() be better to capture the fact 
it isn't a getter on the UpdateVisior?

	Andy

>
> -Stephen
>

Re: Streaming SPARQL Updates

Posted by Stephen Allen <sa...@apache.org>.

Replying to both Andy and Rob in this email, comments inline.

So this part about what the proper way to modify the update engine was
one of the parts I had the most difficulty with.  So feel free to dive
in and change it where you see it can be done better.

On Thu, Jan 31, 2013 at 7:48 AM, Rob Vesse <rv...@yarcdata.com> wrote:
> Awesome
>
> Can we have some javadoc on the new interfaces please?
>
> Couple of questions since I now need to at least stub these for our
> internal implementations.
>
> Primarily these revolve around how one opts out of dynamic updates since
> these won't be usable in our environment due to the peculiarities of our
> hybrid computing architecture.
>
> Is it safe/recommended to just throw an exception in methods like
> createStreaming() of UpdateEngineFactory if your implementation does not
> support streaming?

I didn't want to refactor too much, so the streaming support was
rather grafted on (and not that clear).  Basically there are now two
codepaths through Fuseki that depend on whether the GraphStore
implements the Transactional interface.  I think the code may
currently be considered kind of broken, as it assumes that if you
implement Transactional, then you can also supply a
UpdateEngineStreaming.  There is no fallback to a regular
UpdateEngine.

I think this is rather ugly, and we need a single codepath (based on a
streamable API).  See below in Andy's comments.

> Some possible slight refactors:
>
> UpdateEngineBase implements UpdateEngine and UpdateEngineStreaming - can I
> suggest that we add an UpdateEngineStreamingBase which adds the
> UpdateEngineStreaming interface so non-streaming implementations can still
> just extend UpdateEngineBase.  Either that implement UpdateEngineStreaming
> in UpdateEngineMain not in UpdateEngineBase.
>
> The getInsertDataSink() and getDeleteDataSink() on UpdateVisitor seem at
> odds with a visitor interface, would they not be better off in a separate
> interface?  I see only one usage of these so them being placed there seems
> somewhat arbitrary.

You're correct.  And, furthermore, it is inconsistent that queries and
insert/delete data operations are handled differently.  Ultimately
maybe the visitor interface isn't appropriate.  Basically I see two
options:

  1) Update engines need to create and return sinks that can accept
update operations (see UpdateSink)
  2) Update engines accept an Iterator of update operations
(Iterator<Update>, but with UpdateDataInsert and UpdateDataDelete
changed to return an iterator of quads instead of a list)

Option 2) is conceptually much cleaner from the update engine's point
of view, but to implement that in ARQ would require a queue and a
separate thread like PipedRDFStream / Iterator (this is driven by
JavaCC having to be in charge of parsing).

We currently have something like 1), but it is not very clean since
the old visitor interface is still there.

>
> Rob
>

On Thu, Jan 31, 2013 at 9:45 AM, Andy Seaborne <an...@apache.org> wrote:
> Great stuff and getting into 2.10 is good timing.
>
> I'm working through the changes so comments may come in a bit of a drip
> feed, and hence not necessarily in importance order.
>
>
> ** UpdateEngineFactory
> (default methods interfaces would be nice!)
>
> ++++
> It would be good to hear from from all storage layer implementer on this.
> ++++
>
> If we are making changes, then it would be better to rework this.  I find
> the two independent update engine concepts confusing and I think it does not
> help storage implementers.

I agree, it is not clean how it is.  We should solve this now so
implementer only have to make one change.

> When streaming, the ability to look at the request before deciding whether
> to accept it is lost.  No big deal - I don't think this is ever used.
> Looking at the request can be done by an indirection engine that decides at
> the point of execution which real engine to pass calls to.

Yes, drop that feature.

> So if we go for one UpdateEngine interface, UpdateEngineFactory becomes:
>
>   public boolean accept(GraphStore graphStore, Context context)
>   public UpdateEngine create(GraphStore graphStore, Context context)

I like it.

> Initial binding looks like a mistake because an update is multiple
> operations and historical at best. Remove this (via a deprecation cycle?).
> Use BIND or VALUES nowadays.

Agreed.  Initial binding should be removed.  I'd like to remove it
immediately w/o a deprecation cycle because it complicates the
implementation, and would require storage layer implementers to
continue to support it.  Also would require another breaking API
change when we eventually get rid of it.

It is a user facing change, but is it used much in the wild (I would
guess probably not)?  I would guess that it is probably used a lot
more on the query side (here we should deprecate instead of remove).

> Require an update engine impl to handle any request - it can always bump to
> the general engine.
>
> The core interface API is streaming and UpdateEngineBase provides stream ->
> request that can be overridden.
>
> (have now seen Rob's comment suggesting interfaces that can be instanceof'ed
> so a can just extend UpdateEngineBase as before).
>
>
> ** UsingList:
>
> When applied to a whole request, this is a protocol feature.  I think it's
> better to keep it as such.  It seems a small feature for API use thet could
> be handled via the context (it is the same as rewriting each UpdateModify
> with no set UsingList to have the global one).

Yes, I implemented this the wrong way by passing it all the way down
into the storage engine just so it could include it when it was
building an UpdateSink to return.  Instead a better way would be to
simply just wrap the UpdateSink that comes from the storage engine
with one that rewrites UpdateModify objects from the parser before
passing them on.

This should be an easy fix.

> The use case is forming a dataset for the WHERE part of an update from a
> large collection of graphs.  The storage engine may have opinions here (e.g.
> security, or only certain fixed shapes) so the selection process is, to my
> mind, impl specific.
>
> (but UpdateWithUsing - could do with the UsingList locally internally!)
>
>
> Alternative proposal:
>   have two datasets for update - the one to update and the one to query
>   GraphStore has a getDatasetForWHERE() operation.
>     We need to track down the use of GraphStore for WHERE
>     but only in the general purpose engine.
>     Normally it's "this" - can be reset.
>   It moves responsibility to GraphStore creation.
>   No UsingList in the UpdateExecutionFactory API is needed
>
> In ARQ, you create an appropriately structures GraphStore with the graphs of
> choice.  In Fuseki, it uses DynamicDatasets.dynamicDataset
> (c.f. UpdateEngineWorker.processUsing) for building the view dataset.
>
>
> ** DataSinks
>
> I didn't understand what's happening with the sinks, sorry.

The sinks for insert/delete data are a place to shove the quads that
get generated by the parser.  Instead of storing them in a List<Quad>,
they instead get passed all the way to the update engine which gave us
the sink.

> An update request can be more than one operations, including more than one
> INSERT DATA or DELETE DATA so multiple sinks of both kinds per request?
>
> Shouldn't the sinks be per operation?  In the writer for example:
>
>     @Override
>     public void visit(UpdateDataInsert update)
>     {
>         Iter.sendToSink(update.getQuads(),
>                         getInsertDataSink());
>     }
>
> maybe called for several UpdateDataInsert's

Yes, we call getInsertDataSink() per INSERT_DATA operation, and expect
the storage engine to give us a new one each time.

I am going to commit the change for simplifying the UsingList.  Do you
guys have an opinion on the two options I outlined in the response to
Rob?  I like 2) from an aesthetics point of view, but I worry a lot
about having to spawn a separate thread for each update query.  It
also is a big departure from the current API.  So I will look into how
to clean up 1).

-Stephen

Re: Streaming SPARQL Updates

Posted by Andy Seaborne <an...@apache.org>.

Great stuff and getting into 2.10 is good timing.

I'm working through the changes so comments may come in a bit of a drip 
feed, and hence not necessarily in importance order.


** UpdateEngineFactory
(default methods interfaces would be nice!)

++++
It would be good to hear from from all storage layer implementer on this.
++++

If we are making changes, then it would be better to rework this.  I 
find the two independent update engine concepts confusing and I think it 
does not help storage implementers.

When streaming, the ability to look at the request before deciding 
whether to accept it is lost.  No big deal - I don't think this is ever 
used.  Looking at the request can be done by an indirection engine that 
decides at the point of execution which real engine to pass calls to.

So if we go for one UpdateEngine interface, UpdateEngineFactory becomes:

   public boolean accept(GraphStore graphStore, Context context)
   public UpdateEngine create(GraphStore graphStore, Context context)

Initial binding looks like a mistake because an update is multiple 
operations and historical at best. Remove this (via a deprecation 
cycle?).  Use BIND or VALUES nowadays.

Require an update engine impl to handle any request - it can always bump 
to the general engine.

The core interface API is streaming and UpdateEngineBase provides stream 
-> request that can be overridden.

(have now seen Rob's comment suggesting interfaces that can be 
instanceof'ed so a can just extend UpdateEngineBase as before).


** UsingList:

When applied to a whole request, this is a protocol feature.  I think 
it's better to keep it as such.  It seems a small feature for API use 
thet could be handled via the context (it is the same as rewriting each 
UpdateModify with no set UsingList to have the global one).

The use case is forming a dataset for the WHERE part of an update from a 
large collection of graphs.  The storage engine may have opinions here 
(e.g. security, or only certain fixed shapes) so the selection process 
is, to my mind, impl specific.

(but UpdateWithUsing - could do with the UsingList locally internally!)


Alternative proposal:
   have two datasets for update - the one to update and the one to query
   GraphStore has a getDatasetForWHERE() operation.
     We need to track down the use of GraphStore for WHERE
     but only in the general purpose engine.
     Normally it's "this" - can be reset.
   It moves responsibility to GraphStore creation.
   No UsingList in the UpdateExecutionFactory API is needed

In ARQ, you create an appropriately structures GraphStore with the 
graphs of choice.  In Fuseki, it uses DynamicDatasets.dynamicDataset
(c.f. UpdateEngineWorker.processUsing) for building the view dataset.


** DataSinks

I didn't understand what's happening with the sinks, sorry.

An update request can be more than one operations, including more than 
one INSERT DATA or DELETE DATA so multiple sinks of both kinds per request?

Shouldn't the sinks be per operation?  In the writer for example:

     @Override
     public void visit(UpdateDataInsert update)
     {
         Iter.sendToSink(update.getQuads(),
                         getInsertDataSink());
     }

maybe called for several UpdateDataInsert's

Re: Streaming SPARQL Updates

Posted by Rob Vesse <rv...@yarcdata.com>.

Awesome

Can we have some javadoc on the new interfaces please?

Couple of questions since I now need to at least stub these for our
internal implementations.

Primarily these revolve around how one opts out of dynamic updates since
these won't be usable in our environment due to the peculiarities of our
hybrid computing architecture.

Is it safe/recommended to just throw an exception in methods like
createStreaming() of UpdateEngineFactory if your implementation does not
support streaming?

Some possible slight refactors:

UpdateEngineBase implements UpdateEngine and UpdateEngineStreaming - can I
suggest that we add an UpdateEngineStreamingBase which adds the
UpdateEngineStreaming interface so non-streaming implementations can still
just extend UpdateEngineBase.  Either that implement UpdateEngineStreaming
in UpdateEngineMain not in UpdateEngineBase.

The getInsertDataSink() and getDeleteDataSink() on UpdateVisitor seem at
odds with a visitor interface, would they not be better off in a separate
interface?  I see only one usage of these so them being placed there seems
somewhat arbitrary.

Rob

On 1/31/13 4:33 AM, "Stephen Allen" <sa...@apache.org> wrote:

>All,
>
>I've finally completed the work I was doing on streaming SPARQL Updates.
>Yay!
>
>There should be no user visible changes from this update, but there
>will be some changes for 3rd party storage engines.  Specifically they
>need to implement a couple of new methods in their UpdateEngineFactory
>implementation.  See the changes to
>jena-tdb/com.hp.hpl.jena.tdb.modify.UpdateEngineTDB in revision
>1440841 to see what probably needs to be done for your engine.
>
>What this change enables:
>  1) Update requests that can contain an unlimited number of queries
>  2) Update queries that can contain an unlimited number of quads
>inside of INSERT DATA and DELETE DATA queries
>  3) If the underlying storage engine is transactional, then Fuseki
>can now stream the updates directly to the engine, theoretically
>allowing an unlimited number of quads to be inserted in a single
>query*
>
>To exploit these new capabilities to the fullest, you probably need to
>use jena-client, which is able create streaming update requests both
>for local datasets, and remote HTTP endpoints.
>
>Please let me know if you notice any problems.  As this is a pretty
>big feature change, I would appreciate test reports if you have time.
>
>-Stephen
>
>
>* I believe TDB still buffers uncommitted changes in memory, so there
>is still a limit here (but it is a lot better than before, which
>stored multiple copies of the update request in memory).  You can test
>this unlimited feature by using GraphStoreNull, which will just throw
>away new quads (use the assembler attached to JENA-330 to stand up a
>Fuseki instance using GraphStoreNull).