You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2013/06/18 16:26:22 UTC

RDF Detla - recording changes to RDF Datasets

I started writing up a format for transferring changes between dataset 
copies (copies in time and in location).

https://cwiki.apache.org/confluence/display/JENA/RDF+Delta

Still rough and ready but I hope it gives a general impression of the 
format and usage.

Comments, thoughts, discussion here on dev@ please.

	Andy

Re: RDF Patch

Posted by Andy Seaborne <an...@apache.org>.

On 21/06/13 18:31, Rob Vesse wrote:
> I went ahead and submitted a pull request for various
> typographical/editorial tweaks

I've added a use case section that talks about HTTP PATCH.  Not really a 
"use" (why did the app call PATCH?) but going back, it's good to point 
out the role it could play in PATCH.

> I also went ahead and renamed Minimise Actions to Canonical Patches as
> that makes a much clearer name for it, not sure this is quite the correct
> terminology though

That's a better name ...

Maybe just call it "reversible", which is the main point.

Or a "Strong RDF Patch"

	Andy

>
> Rob
>
>
>
> On 6/20/13 3:38 PM, "Rob Vesse" <rv...@cray.com> wrote:
>
>> I did read some of the working group discussions around the patch format
>> and some of the stuff they were discussing made me want to cry at the
>> horrific syntax abuses some people were proposing to make
>>
>> Steering them towards something that is simpler like RDF patch would seem
>> a good idea
>>
>> Rob
>>
>>
>>
>> On 6/20/13 3:03 PM, "Andy Seaborne" <an...@apache.org> wrote:
>>
>>>
>>> BTW, I got a ping from LDP-WG about a patch format.  That WG want
>>> something sub-SPARQL, this maybe a useful input.
>>>
>>>
>>> I've looked before at RDF-encoded versions (Talis ChangeSets, using
>>> TriG) but without further syntax or processing rules, they don't stream
>>> and it needs a whole request read in before processing.  That a severe
>>> limitation.
>>>
>>> Example:
>>>
>>> @prefix diff: <http://example/diff#> .
>>> @prefix :     <http://example/data#> .
>>>
>>> <#g2> { :s :p 456  }
>>> <#g1> { :s :p 123  }
>>>
>>> <#g1> { :x :q "foo" }
>>>
>>> { <> diff:delete <#g1> ;
>>>       diff:insert <#g2> .
>>> }
>>>
>>> with the manifest default graph last, you can't tell anything about
>>> <#g1> or <#g2> so the best I can imagine is to stash them away somewhere.
>>>
>>> And does not cope with datasets (a graph-grouped complex manifest would
>>> work but then any simplicity is lost and production of such patches is
>>> looking a bit troublesome)
>>>
>>> And then there's blank nodes.
>>>
>>> Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ...
>>> except bNodes.  An advantage is adding naturally "DROP GRAPH" and
>>> "DELETE WHERE".
>>>
>>> 	Andy
>>>
>>
>


Re: RDF Patch

Posted by Rob Vesse <rv...@yarcdata.com>.
I went ahead and submitted a pull request for various
typographical/editorial tweaks

I also went ahead and renamed Minimise Actions to Canonical Patches as
that makes a much clearer name for it, not sure this is quite the correct
terminology though

Rob



On 6/20/13 3:38 PM, "Rob Vesse" <rv...@cray.com> wrote:

>I did read some of the working group discussions around the patch format
>and some of the stuff they were discussing made me want to cry at the
>horrific syntax abuses some people were proposing to make
>
>Steering them towards something that is simpler like RDF patch would seem
>a good idea
>
>Rob
>
>
>
>On 6/20/13 3:03 PM, "Andy Seaborne" <an...@apache.org> wrote:
>
>>
>>BTW, I got a ping from LDP-WG about a patch format.  That WG want
>>something sub-SPARQL, this maybe a useful input.
>>
>>
>>I've looked before at RDF-encoded versions (Talis ChangeSets, using
>>TriG) but without further syntax or processing rules, they don't stream
>>and it needs a whole request read in before processing.  That a severe
>>limitation.
>>
>>Example:
>>
>>@prefix diff: <http://example/diff#> .
>>@prefix :     <http://example/data#> .
>>
>><#g2> { :s :p 456  }
>><#g1> { :s :p 123  }
>>
>><#g1> { :x :q "foo" }
>>
>>{ <> diff:delete <#g1> ;
>>      diff:insert <#g2> .
>>}
>>
>>with the manifest default graph last, you can't tell anything about
>><#g1> or <#g2> so the best I can imagine is to stash them away somewhere.
>>
>>And does not cope with datasets (a graph-grouped complex manifest would
>>work but then any simplicity is lost and production of such patches is
>>looking a bit troublesome)
>>
>>And then there's blank nodes.
>>
>>Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ...
>>except bNodes.  An advantage is adding naturally "DROP GRAPH" and
>>"DELETE WHERE".
>>
>>	Andy
>>
>


Re: RDF Patch

Posted by Rob Vesse <rv...@yarcdata.com>.
I did read some of the working group discussions around the patch format
and some of the stuff they were discussing made me want to cry at the
horrific syntax abuses some people were proposing to make

Steering them towards something that is simpler like RDF patch would seem
a good idea

Rob



On 6/20/13 3:03 PM, "Andy Seaborne" <an...@apache.org> wrote:

>
>BTW, I got a ping from LDP-WG about a patch format.  That WG want
>something sub-SPARQL, this maybe a useful input.
>
>
>I've looked before at RDF-encoded versions (Talis ChangeSets, using
>TriG) but without further syntax or processing rules, they don't stream
>and it needs a whole request read in before processing.  That a severe
>limitation.
>
>Example:
>
>@prefix diff: <http://example/diff#> .
>@prefix :     <http://example/data#> .
>
><#g2> { :s :p 456  }
><#g1> { :s :p 123  }
>
><#g1> { :x :q "foo" }
>
>{ <> diff:delete <#g1> ;
>      diff:insert <#g2> .
>}
>
>with the manifest default graph last, you can't tell anything about
><#g1> or <#g2> so the best I can imagine is to stash them away somewhere.
>
>And does not cope with datasets (a graph-grouped complex manifest would
>work but then any simplicity is lost and production of such patches is
>looking a bit troublesome)
>
>And then there's blank nodes.
>
>Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ...
>except bNodes.  An advantage is adding naturally "DROP GRAPH" and
>"DELETE WHERE".
>
>	Andy
>


Re: RDF Patch

Posted by Andy Seaborne <an...@apache.org>.
BTW, I got a ping from LDP-WG about a patch format.  That WG want 
something sub-SPARQL, this maybe a useful input.


I've looked before at RDF-encoded versions (Talis ChangeSets, using 
TriG) but without further syntax or processing rules, they don't stream 
and it needs a whole request read in before processing.  That a severe 
limitation.

Example:

@prefix diff: <http://example/diff#> .
@prefix :     <http://example/data#> .

<#g2> { :s :p 456  }
<#g1> { :s :p 123  }

<#g1> { :x :q "foo" }

{ <> diff:delete <#g1> ;
      diff:insert <#g2> .
}

with the manifest default graph last, you can't tell anything about 
<#g1> or <#g2> so the best I can imagine is to stash them away somewhere.

And does not cope with datasets (a graph-grouped complex manifest would 
work but then any simplicity is lost and production of such patches is 
looking a bit troublesome)

And then there's blank nodes.

Restricted SPARQL Update(INSERT DATA, DELETE DATA) sort of works ... 
except bNodes.  An advantage is adding naturally "DROP GRAPH" and 
"DELETE WHERE".

	Andy


Re: RDF Patch

Posted by Andy Seaborne <an...@apache.org>.
On 20/06/13 22:33, Rob Vesse wrote:
> You could add a Binary Serialization to the TODO list

OK - I didn't want to imply anything at the moment but since you mention 
it ... done!

	Andy

>
> Rob
>
>
>
> On 6/20/13 1:46 PM, "Andy Seaborne" <an...@apache.org> wrote:
>
>> On 20/06/13 20:39, Stephen Allen wrote:
>>> On Thu, Jun 20, 2013 at 3:15 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> Moved:
>>>>
>>>> http://afs.github.io/rdf-**patch/ <http://afs.github.io/rdf-patch/>
>>>>
>>>> and ReSpec'ed.
>>>>
>>>>
>>> Another idea.  Maybe a header of some type to record various bits of
>>> metadata.  One important one might be whether or not the file was in
>>> "minimized form".  Presumably, you'd want this to be RDF as well.
>>> Example
>>> ("H" stands for header):
>>>
>>> H _:b a <http://jena.apache.org/2013/06/rdf-patch#Patch> .
>>> H _:b rdfs:comment "Generated by Jena Fuseki" .
>>> H _:b dc:date "2013-06-20"^^xsd:date .
>>> H _:b <http://jena.apache.org/2013/06/rdf-patch#minimizedForm> true .
>>>
>>> etc.
>>>
>>> I think you'd only allow H rows to appear before any A or D rows appear
>>> (but allow @prefix statements before it).
>>>
>>> I don't know exactly what you'd want to put in an ontology like this,
>>> but
>>> it may be useful.  Also I used a blank node as the subject in my
>>> example,
>>> but perhaps a fixed resource would be better.
>>>
>>> -Stephen
>>>
>>
>> Good points - need to mark whether it's reversible or not (minimal isn't
>> quite the right word for the required characteristic).
>>
>> We could break the no relative URI rule and use <> for "this document" -
>> the bNodes are trick because the label is interpreted not as file
>> scoped, but something to name real store bnodes.  Flipping label scopes
>> might get confusing!
>>
>> Maybe a format specific syntax (RDF Patch isn't RDF) and the parser
>> generates RDF from it.  A link to general file is always possible.
>>
>> H  rdf:type <http://jena.apache.org/2013/06/rdf-patch#Patch> .
>> H  rdfs:comment "Generated by Jena Fuseki" .
>> H  dc:date "2013-06-20"^^xsd:date .
>> H  <http://jena.apache.org/2013/06/rdf-patch#minimizedForm> true .
>> H  link <http://example/more/information.ttl> .
>>
>> 	Andy
>>
>>
>


Re: RDF Patch

Posted by Rob Vesse <rv...@yarcdata.com>.
You could add a Binary Serialization to the TODO list

Rob



On 6/20/13 1:46 PM, "Andy Seaborne" <an...@apache.org> wrote:

>On 20/06/13 20:39, Stephen Allen wrote:
>> On Thu, Jun 20, 2013 at 3:15 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>> Moved:
>>>
>>> http://afs.github.io/rdf-**patch/ <http://afs.github.io/rdf-patch/>
>>>
>>> and ReSpec'ed.
>>>
>>>
>> Another idea.  Maybe a header of some type to record various bits of
>> metadata.  One important one might be whether or not the file was in
>> "minimized form".  Presumably, you'd want this to be RDF as well.
>>Example
>> ("H" stands for header):
>>
>> H _:b a <http://jena.apache.org/2013/06/rdf-patch#Patch> .
>> H _:b rdfs:comment "Generated by Jena Fuseki" .
>> H _:b dc:date "2013-06-20"^^xsd:date .
>> H _:b <http://jena.apache.org/2013/06/rdf-patch#minimizedForm> true .
>>
>> etc.
>>
>> I think you'd only allow H rows to appear before any A or D rows appear
>> (but allow @prefix statements before it).
>>
>> I don't know exactly what you'd want to put in an ontology like this,
>>but
>> it may be useful.  Also I used a blank node as the subject in my
>>example,
>> but perhaps a fixed resource would be better.
>>
>> -Stephen
>>
>
>Good points - need to mark whether it's reversible or not (minimal isn't
>quite the right word for the required characteristic).
>
>We could break the no relative URI rule and use <> for "this document" -
>the bNodes are trick because the label is interpreted not as file
>scoped, but something to name real store bnodes.  Flipping label scopes
>might get confusing!
>
>Maybe a format specific syntax (RDF Patch isn't RDF) and the parser
>generates RDF from it.  A link to general file is always possible.
>
>H  rdf:type <http://jena.apache.org/2013/06/rdf-patch#Patch> .
>H  rdfs:comment "Generated by Jena Fuseki" .
>H  dc:date "2013-06-20"^^xsd:date .
>H  <http://jena.apache.org/2013/06/rdf-patch#minimizedForm> true .
>H  link <http://example/more/information.ttl> .
>
>	Andy
>
>


Re: RDF Patch

Posted by Andy Seaborne <an...@apache.org>.
On 20/06/13 20:39, Stephen Allen wrote:
> On Thu, Jun 20, 2013 at 3:15 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> Moved:
>>
>> http://afs.github.io/rdf-**patch/ <http://afs.github.io/rdf-patch/>
>>
>> and ReSpec'ed.
>>
>>
> Another idea.  Maybe a header of some type to record various bits of
> metadata.  One important one might be whether or not the file was in
> "minimized form".  Presumably, you'd want this to be RDF as well.  Example
> ("H" stands for header):
>
> H _:b a <http://jena.apache.org/2013/06/rdf-patch#Patch> .
> H _:b rdfs:comment "Generated by Jena Fuseki" .
> H _:b dc:date "2013-06-20"^^xsd:date .
> H _:b <http://jena.apache.org/2013/06/rdf-patch#minimizedForm> true .
>
> etc.
>
> I think you'd only allow H rows to appear before any A or D rows appear
> (but allow @prefix statements before it).
>
> I don't know exactly what you'd want to put in an ontology like this, but
> it may be useful.  Also I used a blank node as the subject in my example,
> but perhaps a fixed resource would be better.
>
> -Stephen
>

Good points - need to mark whether it's reversible or not (minimal isn't 
quite the right word for the required characteristic).

We could break the no relative URI rule and use <> for "this document" - 
the bNodes are trick because the label is interpreted not as file 
scoped, but something to name real store bnodes.  Flipping label scopes 
might get confusing!

Maybe a format specific syntax (RDF Patch isn't RDF) and the parser 
generates RDF from it.  A link to general file is always possible.

H  rdf:type <http://jena.apache.org/2013/06/rdf-patch#Patch> .
H  rdfs:comment "Generated by Jena Fuseki" .
H  dc:date "2013-06-20"^^xsd:date .
H  <http://jena.apache.org/2013/06/rdf-patch#minimizedForm> true .
H  link <http://example/more/information.ttl> .

	Andy



Re: RDF Patch

Posted by Stephen Allen <sa...@apache.org>.
On Thu, Jun 20, 2013 at 3:15 PM, Andy Seaborne <an...@apache.org> wrote:

> Moved:
>
> http://afs.github.io/rdf-**patch/ <http://afs.github.io/rdf-patch/>
>
> and ReSpec'ed.
>
>
Another idea.  Maybe a header of some type to record various bits of
metadata.  One important one might be whether or not the file was in
"minimized form".  Presumably, you'd want this to be RDF as well.  Example
("H" stands for header):

H _:b a <http://jena.apache.org/2013/06/rdf-patch#Patch> .
H _:b rdfs:comment "Generated by Jena Fuseki" .
H _:b dc:date "2013-06-20"^^xsd:date .
H _:b <http://jena.apache.org/2013/06/rdf-patch#minimizedForm> true .

etc.

I think you'd only allow H rows to appear before any A or D rows appear
(but allow @prefix statements before it).

I don't know exactly what you'd want to put in an ontology like this, but
it may be useful.  Also I used a blank node as the subject in my example,
but perhaps a fixed resource would be better.

-Stephen

RDF Patch

Posted by Andy Seaborne <an...@apache.org>.
Moved:

http://afs.github.io/rdf-patch/

and ReSpec'ed.

	Andy

Re: RDF Delta - recording changes to RDF Datasets

Posted by Simon Helsen <sh...@ca.ibm.com>.
as an aside, I always think of G in the first spot as well and I have 
always found N-Quad format very counter intuitive. Andy, I suspect that is 
because in our head, we group triples per graph, i.e. we lexicographically 
sort per graph, then the rest (very much like a dictionary). In N-Quad, if 
you sort by graph, everything is reversed. 

On the issue, I also favor using the standard, i.e. N-Quad

Simon





From:
Andy Seaborne <an...@apache.org>
To:
dev@jena.apache.org, 
Date:
06/20/2013 09:17 AM
Subject:
Re: RDF Delta - recording changes to RDF Datasets



I think the use of N-Quads order is better, given N-Quads exists.  I 
always think of quads a G-S-P-O (but I have no idea why!) and it just 
got written that way because.

The format does really need to be parsed in complete rows before 
deciding what to do with a row so, caveat very large literals (VLL), 
batching by graph isn't greatly affected.

VLL (Very Long Literals) of themselves could do with special handling.
But at the same time, I'd like to assume subjects-as-literals which 
means they are not necessarily in the final object slot in GSPO order 
when you could imagine special handling enabled by G-first.

Added a comments/todo section to not loose any of these points.

     Andy


On 19/06/13 00:56, Rob Vesse wrote:
> The format already allows arbitrarily sized tuples (well in the current
> form it is capped at 255 columns per tuple) though it assumes that this
> will be used to convey SPARQL results and thus currently requires that
> column headers be provided.  Both those restrictions would be fairly 
easy
> to remove.
>
> I will raise the issue of open sourcing with management again and see if 
I
> get any traction.
>
> On the subject of column ordering I can see benefits of putting the <g>
> field first in that it may make it easier to batch operations on a 
single
> graph though I don't think putting it at the end to align with NQuads
> precludes this you just require slightly more lookahead to determine
> whether to continue adding statements to your batch.
>
> Rob
>
>
>
> On 6/18/13 4:41 PM, "Stephen Allen" <sa...@apache.org> wrote:
>
>> On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>> On 18/06/13 22:13, Rob Vesse wrote:
>>>
>>>> Hey Andy
>>>>
>>>
>>> Hi Rob - thanks for the comments - really appreciate feedback -
>>>
>>>
>>>
>>>> The basic approach looks sound and I like the simple text based 
format,
>>>> see my notes later on about maybe having a binary serialization as
>>>> well.
>>>>
>>>
>>> A binary forms would excellent for this and for NT and NQ.  One of the
>>> speed limitations is parsing and Turtle is slower than NT (this isn't
>>> just
>>> a Jena effect).  gzip is neutral for reading but slows down writing.
>>> So a
>>> fast file format would be quite useful to add to the tool box.
>>>
>>>
>>>   How do you envisage incremental backups being implemented in 
practice,
>>> you
>>>> suggest in the document that you would take a full RDF dump and then
>>>> compute the RDF delta from a previous backup.  Talking from the
>>>> experience
>>>> of having done this as part of one of my experiments in my PhD this
>>>> can be
>>>> very complex and time consuming to do especially if you need to take
>>>> care
>>>> of BNode isomorphism.  I assume from some of the other discussion on
>>>> BNodes that you assume that IDs will remain stable across dumps, thus
>>>> there is an implicit requirement here that the database be able to 
dump
>>>> RDF using consistent BNode IDs (either internal IDs or some stable
>>>> round
>>>> trippable IDs).  Taking ARQ as an example the existing NQuads/TriG
>>>> writers
>>>> do not do this so there would need to be an option for those writers
>>>> to be
>>>> able to support this.
>>>>
>>>
>>> Shh, don't tell anyone but n-quads and n-triples outputs do dump
>>> recoverable bNode labels :-)  TriG and Turtle do not - they try to be
>>> pretty.  The readers need a  tweak to recover them but the label->Node
>>> code
>>> has an option for various label policies and recover id from label is
>>> one
>>> of them.  This is not exposed formally - it's strictly illegal for RDF
>>> syntaxes.  Or use <_:label> URIs.
>>>
>>> I have prototyped a wrapper dataset that records changes as they 
happen
>>> driven off add(quad) and delete(quad).  This produces the RDF Delta
>>> (sp!)
>>> form so couple to xtn and you can have a "live incremental backup".
>>>
>>> A strict after-the-event delta would be prohibitively expensive.
>>>
>>>
>>>   Even without any concerns of BNode isomorphism comparing two RDF 
dumps
>>> to
>>>> create a delta could be a potentially very time consuming operation 
and
>>>> recording the deltas as changes happen may be far more efficient.  Of
>>>> course depending on the exact use case the RDF dump and compute delta
>>>> approach may be acceptable.
>>>>
>>>
>>> It isn't a delta in the set theory A\B sense - nor is it a diff (it's
>>> not
>>> reversible without the additional condition).  "delta" and "diff" are
>>> both
>>> names I've toyed with - "RDF changes" might better capture the idea. 
Or
>>> "RDF Changes Log".
>>>
>>>
>>>   My main criticism is on the "Minimise actions" section, there needs 
to
>>> be
>>>> a more solid clarification of definitions and when minimization can 
and
>>>> should happen.
>>>>
>>>
>>> Yes - it isn't as well covered in the doc.
>>>
>>> Logically - or generally - in teh event generating dataset wrpapper:
>>>
>>>          if ( contains(g,s,p,o) ) {
>>>              record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
>>>              return ;
>>>          }
>>>
>>>          add(g,s,p,o) ;
>>>          record(QuadAction.ADD,g,s,p,o) ;        // Action.
>>>
>>> https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**
>>>
>>> projects/recorder<
https://github.com/afs/AFS-Dev/tree/master/src/main/jav
>>> a/projects/recorder>
>>>
>>> but implementations like TDB can do it without the contains() as the
>>> indexes already return true/false for whether a change occurred or 
not.
>>>
>>>
>>>
>>>> For example:
>>>>
>>>> "When written in minimise form the RDF Delta can be run backwards, to
>>>> undo
>>>> a change. This only works when real changes are recorded because
>>>> otherwise
>>>> knowing a triple is added does not mean it was not there before."
>>>>
>>>> While I agree it is necessary to record real changes for deltas to be
>>>> reverse applied I'm not convinced they have to be in minimized form 
(at
>>>> least based on how the definition of minimized form reads right now),
>>>> if
>>>> only real changes are recorded then deltas will be in a minimal form.
>>>>
>>>> Yet it is not entirely clear by your definition the following delta
>>>> would
>>>> be considered minimal:
>>>>
>>>> A <http://s> <http://p> <http://o>
>>>> R <http://s> <http://p> <http://o>
>>>> A <http://s> <http://p> <http://o>
>>>>
>>>
>>> If the dataset did not originally contain <http://s> <http://p>
>>> <http://o>
>>> then that is minimal.  Each row makes a real change ; it's the fast 
that
>>> graphs/datasets are set of triples/quads that the real change is 
needed.
>>>
>>>
>>>   I'm assuming that your intention was that such deltas should not be
>>>> minimized but perhaps this needs to be more clear in the document.
>>>>
>>>
>>> There is no reason not to allow the redundant first two A-D to be
>>> removed
>>> but it's not required.
>>>
>>>
>>>   On the topic of related work:
>>>>
>>>> I think I may have mentioned previously that I've done some research
>>>> work
>>>> internally here at YarcData on a general purpose binary serialization
>>>> for
>>>> Triples, Quads and Tuples which likely could be fairly trivially
>>>> extended
>>>> to carry a binary encoding of the deltas as well which may save 
space.
>>>> For ball park comparison purposes compression is roughly equivalent 
to
>>>> GZipping raw NTriples with the key advantage being that the format is
>>>> significantly faster to process even in its current prototype single
>>>> threaded implementation (the design was written to take advantage of
>>>> parallelism).  There are a bunch of further optimizations that I had
>>>> ideas
>>>> for that I never got as far as implementing because of lack of
>>>> management
>>>> support for the concept.
>>>>
>>>
>>> My experience is that the cost of writing gzip is an appreciable
>>> slowdown.
>>>   If your binary form removes that cost it would help full backups 
quiet
>>> a
>>> lot.
>>>
>>>
>>>   There has been some discussion of open sourcing this work (likely as 
a
>>>> contributed Experimental module to Jena) so that it could be 
developed
>>>> outside of the company, if this sounds like it may be of interest I
>>>> will
>>>> broach the subject with relevant management again and see whether 
this
>>>> can
>>>> happen in the near future.
>>>>
>>>
>>> Please do.  I find the style of having a text form and a binary form
>>> makes
>>> system building easier.  Text files to debug; binary for production.
>>>
>>> We can add e.g. .ntz and .nqz to the known formats -- modules can add
>>> language, syntaxes, parsers and writers.  The JSON-LD module does, so 
I
>>> know it does work from outside; all the built-in ones actually 
register
>>> themselves the same way and have no specials.
>>>
>>>
>> Rob:
>>
>> I would definitely be interested in a binary format for both triples 
and
>> quads.  In fact, if it could be generalized to handle arbitrarily sized
>> RDF
>> tuples, that would be even better.  I would like to replace the current
>> text-based solution used for the spill-to-disk functionality.
>>
>> Andy:
>> I like what you've done and think it could be very useful.  One
>> suggestion:
>> the order of the tuples should be <s> <p> <o> <g> to match the N-Quads
>> format [1].
>>
>>
>> -Stephen
>>
>> [1] http://www.w3.org/TR/n-quads/
>




Re: RDF Delta - recording changes to RDF Datasets

Posted by Andy Seaborne <an...@apache.org>.
I think the use of N-Quads order is better, given N-Quads exists.  I 
always think of quads a G-S-P-O (but I have no idea why!) and it just 
got written that way because.

The format does really need to be parsed in complete rows before 
deciding what to do with a row so, caveat very large literals (VLL), 
batching by graph isn't greatly affected.

VLL (Very Long Literals) of themselves could do with special handling.
But at the same time, I'd like to assume subjects-as-literals which 
means they are not necessarily in the final object slot in GSPO order 
when you could imagine special handling enabled by G-first.

Added a comments/todo section to not loose any of these points.

     Andy


On 19/06/13 00:56, Rob Vesse wrote:
> The format already allows arbitrarily sized tuples (well in the current
> form it is capped at 255 columns per tuple) though it assumes that this
> will be used to convey SPARQL results and thus currently requires that
> column headers be provided.  Both those restrictions would be fairly easy
> to remove.
>
> I will raise the issue of open sourcing with management again and see if I
> get any traction.
>
> On the subject of column ordering I can see benefits of putting the <g>
> field first in that it may make it easier to batch operations on a single
> graph though I don't think putting it at the end to align with NQuads
> precludes this you just require slightly more lookahead to determine
> whether to continue adding statements to your batch.
>
> Rob
>
>
>
> On 6/18/13 4:41 PM, "Stephen Allen" <sa...@apache.org> wrote:
>
>> On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne <an...@apache.org> wrote:
>>
>>> On 18/06/13 22:13, Rob Vesse wrote:
>>>
>>>> Hey Andy
>>>>
>>>
>>> Hi Rob - thanks for the comments - really appreciate feedback -
>>>
>>>
>>>
>>>> The basic approach looks sound and I like the simple text based format,
>>>> see my notes later on about maybe having a binary serialization as
>>>> well.
>>>>
>>>
>>> A binary forms would excellent for this and for NT and NQ.  One of the
>>> speed limitations is parsing and Turtle is slower than NT (this isn't
>>> just
>>> a Jena effect).  gzip is neutral for reading but slows down writing.
>>> So a
>>> fast file format would be quite useful to add to the tool box.
>>>
>>>
>>>   How do you envisage incremental backups being implemented in practice,
>>> you
>>>> suggest in the document that you would take a full RDF dump and then
>>>> compute the RDF delta from a previous backup.  Talking from the
>>>> experience
>>>> of having done this as part of one of my experiments in my PhD this
>>>> can be
>>>> very complex and time consuming to do especially if you need to take
>>>> care
>>>> of BNode isomorphism.  I assume from some of the other discussion on
>>>> BNodes that you assume that IDs will remain stable across dumps, thus
>>>> there is an implicit requirement here that the database be able to dump
>>>> RDF using consistent BNode IDs (either internal IDs or some stable
>>>> round
>>>> trippable IDs).  Taking ARQ as an example the existing NQuads/TriG
>>>> writers
>>>> do not do this so there would need to be an option for those writers
>>>> to be
>>>> able to support this.
>>>>
>>>
>>> Shh, don't tell anyone but n-quads and n-triples outputs do dump
>>> recoverable bNode labels :-)  TriG and Turtle do not - they try to be
>>> pretty.  The readers need a  tweak to recover them but the label->Node
>>> code
>>> has an option for various label policies and recover id from label is
>>> one
>>> of them.  This is not exposed formally - it's strictly illegal for RDF
>>> syntaxes.  Or use <_:label> URIs.
>>>
>>> I have prototyped a wrapper dataset that records changes as they happen
>>> driven off add(quad) and delete(quad).  This produces the RDF Delta
>>> (sp!)
>>> form so couple to xtn and you can have a "live incremental backup".
>>>
>>> A strict after-the-event delta would be prohibitively expensive.
>>>
>>>
>>>   Even without any concerns of BNode isomorphism comparing two RDF dumps
>>> to
>>>> create a delta could be a potentially very time consuming operation and
>>>> recording the deltas as changes happen may be far more efficient.  Of
>>>> course depending on the exact use case the RDF dump and compute delta
>>>> approach may be acceptable.
>>>>
>>>
>>> It isn't a delta in the set theory A\B sense - nor is it a diff (it's
>>> not
>>> reversible without the additional condition).  "delta" and "diff" are
>>> both
>>> names I've toyed with - "RDF changes" might better capture the idea.  Or
>>> "RDF Changes Log".
>>>
>>>
>>>   My main criticism is on the "Minimise actions" section, there needs to
>>> be
>>>> a more solid clarification of definitions and when minimization can and
>>>> should happen.
>>>>
>>>
>>> Yes - it isn't as well covered in the doc.
>>>
>>> Logically - or generally - in teh event generating dataset wrpapper:
>>>
>>>          if ( contains(g,s,p,o) ) {
>>>              record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
>>>              return ;
>>>          }
>>>
>>>          add(g,s,p,o) ;
>>>          record(QuadAction.ADD,g,s,p,o) ;        // Action.
>>>
>>> https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**
>>>
>>> projects/recorder<https://github.com/afs/AFS-Dev/tree/master/src/main/jav
>>> a/projects/recorder>
>>>
>>> but implementations like TDB can do it without the contains() as the
>>> indexes already return true/false for whether a change occurred or not.
>>>
>>>
>>>
>>>> For example:
>>>>
>>>> "When written in minimise form the RDF Delta can be run backwards, to
>>>> undo
>>>> a change. This only works when real changes are recorded because
>>>> otherwise
>>>> knowing a triple is added does not mean it was not there before."
>>>>
>>>> While I agree it is necessary to record real changes for deltas to be
>>>> reverse applied I'm not convinced they have to be in minimized form (at
>>>> least based on how the definition of minimized form reads right now),
>>>> if
>>>> only real changes are recorded then deltas will be in a minimal form.
>>>>
>>>> Yet it is not entirely clear by your definition the following delta
>>>> would
>>>> be considered minimal:
>>>>
>>>> A <http://s> <http://p> <http://o>
>>>> R <http://s> <http://p> <http://o>
>>>> A <http://s> <http://p> <http://o>
>>>>
>>>
>>> If the dataset did not originally contain <http://s> <http://p>
>>> <http://o>
>>> then that is minimal.  Each row makes a real change ; it's the fast that
>>> graphs/datasets are set of triples/quads that the real change is needed.
>>>
>>>
>>>   I'm assuming that your intention was that such deltas should not be
>>>> minimized but perhaps this needs to be more clear in the document.
>>>>
>>>
>>> There is no reason not to allow the redundant first two A-D to be
>>> removed
>>> but it's not required.
>>>
>>>
>>>   On the topic of related work:
>>>>
>>>> I think I may have mentioned previously that I've done some research
>>>> work
>>>> internally here at YarcData on a general purpose binary serialization
>>>> for
>>>> Triples, Quads and Tuples which likely could be fairly trivially
>>>> extended
>>>> to carry a binary encoding of the deltas as well which may save space.
>>>> For ball park comparison purposes compression is roughly equivalent to
>>>> GZipping raw NTriples with the key advantage being that the format is
>>>> significantly faster to process even in its current prototype single
>>>> threaded implementation (the design was written to take advantage of
>>>> parallelism).  There are a bunch of further optimizations that I had
>>>> ideas
>>>> for that I never got as far as implementing because of lack of
>>>> management
>>>> support for the concept.
>>>>
>>>
>>> My experience is that the cost of writing gzip is an appreciable
>>> slowdown.
>>>   If your binary form removes that cost it would help full backups quiet
>>> a
>>> lot.
>>>
>>>
>>>   There has been some discussion of open sourcing this work (likely as a
>>>> contributed Experimental module to Jena) so that it could be developed
>>>> outside of the company, if this sounds like it may be of interest I
>>>> will
>>>> broach the subject with relevant management again and see whether this
>>>> can
>>>> happen in the near future.
>>>>
>>>
>>> Please do.  I find the style of having a text form and a binary form
>>> makes
>>> system building easier.  Text files to debug; binary for production.
>>>
>>> We can add e.g. .ntz and .nqz to the known formats -- modules can add
>>> language, syntaxes, parsers and writers.  The JSON-LD module does, so I
>>> know it does work from outside; all the built-in ones actually register
>>> themselves the same way and have no specials.
>>>
>>>
>> Rob:
>>
>> I would definitely be interested in a binary format for both triples and
>> quads.  In fact, if it could be generalized to handle arbitrarily sized
>> RDF
>> tuples, that would be even better.  I would like to replace the current
>> text-based solution used for the spill-to-disk functionality.
>>
>> Andy:
>> I like what you've done and think it could be very useful.  One
>> suggestion:
>> the order of the tuples should be <s> <p> <o> <g> to match the N-Quads
>> format [1].
>>
>>
>> -Stephen
>>
>> [1] http://www.w3.org/TR/n-quads/
>


Re: RDF Delta - recording changes to RDF Datasets

Posted by Rob Vesse <rv...@yarcdata.com>.
The format already allows arbitrarily sized tuples (well in the current
form it is capped at 255 columns per tuple) though it assumes that this
will be used to convey SPARQL results and thus currently requires that
column headers be provided.  Both those restrictions would be fairly easy
to remove.

I will raise the issue of open sourcing with management again and see if I
get any traction.

On the subject of column ordering I can see benefits of putting the <g>
field first in that it may make it easier to batch operations on a single
graph though I don't think putting it at the end to align with NQuads
precludes this you just require slightly more lookahead to determine
whether to continue adding statements to your batch.

Rob



On 6/18/13 4:41 PM, "Stephen Allen" <sa...@apache.org> wrote:

>On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 18/06/13 22:13, Rob Vesse wrote:
>>
>>> Hey Andy
>>>
>>
>> Hi Rob - thanks for the comments - really appreciate feedback -
>>
>>
>>
>>> The basic approach looks sound and I like the simple text based format,
>>> see my notes later on about maybe having a binary serialization as
>>>well.
>>>
>>
>> A binary forms would excellent for this and for NT and NQ.  One of the
>> speed limitations is parsing and Turtle is slower than NT (this isn't
>>just
>> a Jena effect).  gzip is neutral for reading but slows down writing.
>>So a
>> fast file format would be quite useful to add to the tool box.
>>
>>
>>  How do you envisage incremental backups being implemented in practice,
>>you
>>> suggest in the document that you would take a full RDF dump and then
>>> compute the RDF delta from a previous backup.  Talking from the
>>>experience
>>> of having done this as part of one of my experiments in my PhD this
>>>can be
>>> very complex and time consuming to do especially if you need to take
>>>care
>>> of BNode isomorphism.  I assume from some of the other discussion on
>>> BNodes that you assume that IDs will remain stable across dumps, thus
>>> there is an implicit requirement here that the database be able to dump
>>> RDF using consistent BNode IDs (either internal IDs or some stable
>>>round
>>> trippable IDs).  Taking ARQ as an example the existing NQuads/TriG
>>>writers
>>> do not do this so there would need to be an option for those writers
>>>to be
>>> able to support this.
>>>
>>
>> Shh, don't tell anyone but n-quads and n-triples outputs do dump
>> recoverable bNode labels :-)  TriG and Turtle do not - they try to be
>> pretty.  The readers need a  tweak to recover them but the label->Node
>>code
>> has an option for various label policies and recover id from label is
>>one
>> of them.  This is not exposed formally - it's strictly illegal for RDF
>> syntaxes.  Or use <_:label> URIs.
>>
>> I have prototyped a wrapper dataset that records changes as they happen
>> driven off add(quad) and delete(quad).  This produces the RDF Delta
>>(sp!)
>> form so couple to xtn and you can have a "live incremental backup".
>>
>> A strict after-the-event delta would be prohibitively expensive.
>>
>>
>>  Even without any concerns of BNode isomorphism comparing two RDF dumps
>>to
>>> create a delta could be a potentially very time consuming operation and
>>> recording the deltas as changes happen may be far more efficient.  Of
>>> course depending on the exact use case the RDF dump and compute delta
>>> approach may be acceptable.
>>>
>>
>> It isn't a delta in the set theory A\B sense - nor is it a diff (it's
>>not
>> reversible without the additional condition).  "delta" and "diff" are
>>both
>> names I've toyed with - "RDF changes" might better capture the idea.  Or
>> "RDF Changes Log".
>>
>>
>>  My main criticism is on the "Minimise actions" section, there needs to
>>be
>>> a more solid clarification of definitions and when minimization can and
>>> should happen.
>>>
>>
>> Yes - it isn't as well covered in the doc.
>>
>> Logically - or generally - in teh event generating dataset wrpapper:
>>
>>         if ( contains(g,s,p,o) ) {
>>             record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
>>             return ;
>>         }
>>
>>         add(g,s,p,o) ;
>>         record(QuadAction.ADD,g,s,p,o) ;        // Action.
>>
>> https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**
>> 
>>projects/recorder<https://github.com/afs/AFS-Dev/tree/master/src/main/jav
>>a/projects/recorder>
>>
>> but implementations like TDB can do it without the contains() as the
>> indexes already return true/false for whether a change occurred or not.
>>
>>
>>
>>> For example:
>>>
>>> "When written in minimise form the RDF Delta can be run backwards, to
>>>undo
>>> a change. This only works when real changes are recorded because
>>>otherwise
>>> knowing a triple is added does not mean it was not there before."
>>>
>>> While I agree it is necessary to record real changes for deltas to be
>>> reverse applied I'm not convinced they have to be in minimized form (at
>>> least based on how the definition of minimized form reads right now),
>>>if
>>> only real changes are recorded then deltas will be in a minimal form.
>>>
>>> Yet it is not entirely clear by your definition the following delta
>>>would
>>> be considered minimal:
>>>
>>> A <http://s> <http://p> <http://o>
>>> R <http://s> <http://p> <http://o>
>>> A <http://s> <http://p> <http://o>
>>>
>>
>> If the dataset did not originally contain <http://s> <http://p>
>><http://o>
>> then that is minimal.  Each row makes a real change ; it's the fast that
>> graphs/datasets are set of triples/quads that the real change is needed.
>>
>>
>>  I'm assuming that your intention was that such deltas should not be
>>> minimized but perhaps this needs to be more clear in the document.
>>>
>>
>> There is no reason not to allow the redundant first two A-D to be
>>removed
>> but it's not required.
>>
>>
>>  On the topic of related work:
>>>
>>> I think I may have mentioned previously that I've done some research
>>>work
>>> internally here at YarcData on a general purpose binary serialization
>>>for
>>> Triples, Quads and Tuples which likely could be fairly trivially
>>>extended
>>> to carry a binary encoding of the deltas as well which may save space.
>>> For ball park comparison purposes compression is roughly equivalent to
>>> GZipping raw NTriples with the key advantage being that the format is
>>> significantly faster to process even in its current prototype single
>>> threaded implementation (the design was written to take advantage of
>>> parallelism).  There are a bunch of further optimizations that I had
>>>ideas
>>> for that I never got as far as implementing because of lack of
>>>management
>>> support for the concept.
>>>
>>
>> My experience is that the cost of writing gzip is an appreciable
>>slowdown.
>>  If your binary form removes that cost it would help full backups quiet
>>a
>> lot.
>>
>>
>>  There has been some discussion of open sourcing this work (likely as a
>>> contributed Experimental module to Jena) so that it could be developed
>>> outside of the company, if this sounds like it may be of interest I
>>>will
>>> broach the subject with relevant management again and see whether this
>>>can
>>> happen in the near future.
>>>
>>
>> Please do.  I find the style of having a text form and a binary form
>>makes
>> system building easier.  Text files to debug; binary for production.
>>
>> We can add e.g. .ntz and .nqz to the known formats -- modules can add
>> language, syntaxes, parsers and writers.  The JSON-LD module does, so I
>> know it does work from outside; all the built-in ones actually register
>> themselves the same way and have no specials.
>>
>>
>Rob:
>
>I would definitely be interested in a binary format for both triples and
>quads.  In fact, if it could be generalized to handle arbitrarily sized
>RDF
>tuples, that would be even better.  I would like to replace the current
>text-based solution used for the spill-to-disk functionality.
>
>Andy:
>I like what you've done and think it could be very useful.  One
>suggestion:
>the order of the tuples should be <s> <p> <o> <g> to match the N-Quads
>format [1].
>
>
>-Stephen
>
>[1] http://www.w3.org/TR/n-quads/


Re: RDF Delta - recording changes to RDF Datasets

Posted by Stephen Allen <sa...@apache.org>.
On Tue, Jun 18, 2013 at 6:05 PM, Andy Seaborne <an...@apache.org> wrote:

> On 18/06/13 22:13, Rob Vesse wrote:
>
>> Hey Andy
>>
>
> Hi Rob - thanks for the comments - really appreciate feedback -
>
>
>
>> The basic approach looks sound and I like the simple text based format,
>> see my notes later on about maybe having a binary serialization as well.
>>
>
> A binary forms would excellent for this and for NT and NQ.  One of the
> speed limitations is parsing and Turtle is slower than NT (this isn't just
> a Jena effect).  gzip is neutral for reading but slows down writing.  So a
> fast file format would be quite useful to add to the tool box.
>
>
>  How do you envisage incremental backups being implemented in practice, you
>> suggest in the document that you would take a full RDF dump and then
>> compute the RDF delta from a previous backup.  Talking from the experience
>> of having done this as part of one of my experiments in my PhD this can be
>> very complex and time consuming to do especially if you need to take care
>> of BNode isomorphism.  I assume from some of the other discussion on
>> BNodes that you assume that IDs will remain stable across dumps, thus
>> there is an implicit requirement here that the database be able to dump
>> RDF using consistent BNode IDs (either internal IDs or some stable round
>> trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
>> do not do this so there would need to be an option for those writers to be
>> able to support this.
>>
>
> Shh, don't tell anyone but n-quads and n-triples outputs do dump
> recoverable bNode labels :-)  TriG and Turtle do not - they try to be
> pretty.  The readers need a  tweak to recover them but the label->Node code
> has an option for various label policies and recover id from label is one
> of them.  This is not exposed formally - it's strictly illegal for RDF
> syntaxes.  Or use <_:label> URIs.
>
> I have prototyped a wrapper dataset that records changes as they happen
> driven off add(quad) and delete(quad).  This produces the RDF Delta (sp!)
> form so couple to xtn and you can have a "live incremental backup".
>
> A strict after-the-event delta would be prohibitively expensive.
>
>
>  Even without any concerns of BNode isomorphism comparing two RDF dumps to
>> create a delta could be a potentially very time consuming operation and
>> recording the deltas as changes happen may be far more efficient.  Of
>> course depending on the exact use case the RDF dump and compute delta
>> approach may be acceptable.
>>
>
> It isn't a delta in the set theory A\B sense - nor is it a diff (it's not
> reversible without the additional condition).  "delta" and "diff" are both
> names I've toyed with - "RDF changes" might better capture the idea.  Or
> "RDF Changes Log".
>
>
>  My main criticism is on the "Minimise actions" section, there needs to be
>> a more solid clarification of definitions and when minimization can and
>> should happen.
>>
>
> Yes - it isn't as well covered in the doc.
>
> Logically - or generally - in teh event generating dataset wrpapper:
>
>         if ( contains(g,s,p,o) ) {
>             record(QuadAction.NO_ADD,g,s,**p,o) ; // No action.
>             return ;
>         }
>
>         add(g,s,p,o) ;
>         record(QuadAction.ADD,g,s,p,o) ;        // Action.
>
> https://github.com/afs/AFS-**Dev/tree/master/src/main/java/**
> projects/recorder<https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/recorder>
>
> but implementations like TDB can do it without the contains() as the
> indexes already return true/false for whether a change occurred or not.
>
>
>
>> For example:
>>
>> "When written in minimise form the RDF Delta can be run backwards, to undo
>> a change. This only works when real changes are recorded because otherwise
>> knowing a triple is added does not mean it was not there before."
>>
>> While I agree it is necessary to record real changes for deltas to be
>> reverse applied I'm not convinced they have to be in minimized form (at
>> least based on how the definition of minimized form reads right now), if
>> only real changes are recorded then deltas will be in a minimal form.
>>
>> Yet it is not entirely clear by your definition the following delta would
>> be considered minimal:
>>
>> A <http://s> <http://p> <http://o>
>> R <http://s> <http://p> <http://o>
>> A <http://s> <http://p> <http://o>
>>
>
> If the dataset did not originally contain <http://s> <http://p> <http://o>
> then that is minimal.  Each row makes a real change ; it's the fast that
> graphs/datasets are set of triples/quads that the real change is needed.
>
>
>  I'm assuming that your intention was that such deltas should not be
>> minimized but perhaps this needs to be more clear in the document.
>>
>
> There is no reason not to allow the redundant first two A-D to be removed
> but it's not required.
>
>
>  On the topic of related work:
>>
>> I think I may have mentioned previously that I've done some research work
>> internally here at YarcData on a general purpose binary serialization for
>> Triples, Quads and Tuples which likely could be fairly trivially extended
>> to carry a binary encoding of the deltas as well which may save space.
>> For ball park comparison purposes compression is roughly equivalent to
>> GZipping raw NTriples with the key advantage being that the format is
>> significantly faster to process even in its current prototype single
>> threaded implementation (the design was written to take advantage of
>> parallelism).  There are a bunch of further optimizations that I had ideas
>> for that I never got as far as implementing because of lack of management
>> support for the concept.
>>
>
> My experience is that the cost of writing gzip is an appreciable slowdown.
>  If your binary form removes that cost it would help full backups quiet a
> lot.
>
>
>  There has been some discussion of open sourcing this work (likely as a
>> contributed Experimental module to Jena) so that it could be developed
>> outside of the company, if this sounds like it may be of interest I will
>> broach the subject with relevant management again and see whether this can
>> happen in the near future.
>>
>
> Please do.  I find the style of having a text form and a binary form makes
> system building easier.  Text files to debug; binary for production.
>
> We can add e.g. .ntz and .nqz to the known formats -- modules can add
> language, syntaxes, parsers and writers.  The JSON-LD module does, so I
> know it does work from outside; all the built-in ones actually register
> themselves the same way and have no specials.
>
>
Rob:

I would definitely be interested in a binary format for both triples and
quads.  In fact, if it could be generalized to handle arbitrarily sized RDF
tuples, that would be even better.  I would like to replace the current
text-based solution used for the spill-to-disk functionality.

Andy:
I like what you've done and think it could be very useful.  One suggestion:
the order of the tuples should be <s> <p> <o> <g> to match the N-Quads
format [1].


-Stephen

[1] http://www.w3.org/TR/n-quads/

Re: RDF Delta - recording changes to RDF Datasets

Posted by Andy Seaborne <an...@apache.org>.
On 18/06/13 22:13, Rob Vesse wrote:
> Hey Andy

Hi Rob - thanks for the comments - really appreciate feedback -

>
> The basic approach looks sound and I like the simple text based format,
> see my notes later on about maybe having a binary serialization as well.

A binary forms would excellent for this and for NT and NQ.  One of the 
speed limitations is parsing and Turtle is slower than NT (this isn't 
just a Jena effect).  gzip is neutral for reading but slows down 
writing.  So a fast file format would be quite useful to add to the tool 
box.

> How do you envisage incremental backups being implemented in practice, you
> suggest in the document that you would take a full RDF dump and then
> compute the RDF delta from a previous backup.  Talking from the experience
> of having done this as part of one of my experiments in my PhD this can be
> very complex and time consuming to do especially if you need to take care
> of BNode isomorphism.  I assume from some of the other discussion on
> BNodes that you assume that IDs will remain stable across dumps, thus
> there is an implicit requirement here that the database be able to dump
> RDF using consistent BNode IDs (either internal IDs or some stable round
> trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
> do not do this so there would need to be an option for those writers to be
> able to support this.

Shh, don't tell anyone but n-quads and n-triples outputs do dump 
recoverable bNode labels :-)  TriG and Turtle do not - they try to be 
pretty.  The readers need a  tweak to recover them but the label->Node 
code has an option for various label policies and recover id from label 
is one of them.  This is not exposed formally - it's strictly illegal 
for RDF syntaxes.  Or use <_:label> URIs.

I have prototyped a wrapper dataset that records changes as they happen 
driven off add(quad) and delete(quad).  This produces the RDF Delta 
(sp!) form so couple to xtn and you can have a "live incremental backup".

A strict after-the-event delta would be prohibitively expensive.

> Even without any concerns of BNode isomorphism comparing two RDF dumps to
> create a delta could be a potentially very time consuming operation and
> recording the deltas as changes happen may be far more efficient.  Of
> course depending on the exact use case the RDF dump and compute delta
> approach may be acceptable.

It isn't a delta in the set theory A\B sense - nor is it a diff (it's 
not reversible without the additional condition).  "delta" and "diff" 
are both names I've toyed with - "RDF changes" might better capture the 
idea.  Or "RDF Changes Log".

> My main criticism is on the "Minimise actions" section, there needs to be
> a more solid clarification of definitions and when minimization can and
> should happen.

Yes - it isn't as well covered in the doc.

Logically - or generally - in teh event generating dataset wrpapper:

         if ( contains(g,s,p,o) ) {
	    record(QuadAction.NO_ADD,g,s,p,o) ; // No action.
             return ;
         }

	add(g,s,p,o) ;
         record(QuadAction.ADD,g,s,p,o) ;        // Action.

https://github.com/afs/AFS-Dev/tree/master/src/main/java/projects/recorder

but implementations like TDB can do it without the contains() as the 
indexes already return true/false for whether a change occurred or not.

>
> For example:
>
> "When written in minimise form the RDF Delta can be run backwards, to undo
> a change. This only works when real changes are recorded because otherwise
> knowing a triple is added does not mean it was not there before."
>
> While I agree it is necessary to record real changes for deltas to be
> reverse applied I'm not convinced they have to be in minimized form (at
> least based on how the definition of minimized form reads right now), if
> only real changes are recorded then deltas will be in a minimal form.
>
> Yet it is not entirely clear by your definition the following delta would
> be considered minimal:
>
> A <http://s> <http://p> <http://o>
> R <http://s> <http://p> <http://o>
> A <http://s> <http://p> <http://o>

If the dataset did not originally contain <http://s> <http://p> 
<http://o> then that is minimal.  Each row makes a real change ; it's 
the fast that graphs/datasets are set of triples/quads that the real 
change is needed.

> I'm assuming that your intention was that such deltas should not be
> minimized but perhaps this needs to be more clear in the document.

There is no reason not to allow the redundant first two A-D to be 
removed but it's not required.

> On the topic of related work:
>
> I think I may have mentioned previously that I've done some research work
> internally here at YarcData on a general purpose binary serialization for
> Triples, Quads and Tuples which likely could be fairly trivially extended
> to carry a binary encoding of the deltas as well which may save space.
> For ball park comparison purposes compression is roughly equivalent to
> GZipping raw NTriples with the key advantage being that the format is
> significantly faster to process even in its current prototype single
> threaded implementation (the design was written to take advantage of
> parallelism).  There are a bunch of further optimizations that I had ideas
> for that I never got as far as implementing because of lack of management
> support for the concept.

My experience is that the cost of writing gzip is an appreciable 
slowdown.  If your binary form removes that cost it would help full 
backups quiet a lot.

> There has been some discussion of open sourcing this work (likely as a
> contributed Experimental module to Jena) so that it could be developed
> outside of the company, if this sounds like it may be of interest I will
> broach the subject with relevant management again and see whether this can
> happen in the near future.

Please do.  I find the style of having a text form and a binary form 
makes system building easier.  Text files to debug; binary for production.

We can add e.g. .ntz and .nqz to the known formats -- modules can add 
language, syntaxes, parsers and writers.  The JSON-LD module does, so I 
know it does work from outside; all the built-in ones actually register 
themselves the same way and have no specials.

>
> Rob

	Andy


>> https://cwiki.apache.org/confluence/display/JENA/RDF+Delta


Re: RDF Delta - recording changes to RDF Datasets

Posted by Rob Vesse <rv...@yarcdata.com>.
Hey Andy

The basic approach looks sound and I like the simple text based format,
see my notes later on about maybe having a binary serialization as well.

How do you envisage incremental backups being implemented in practice, you
suggest in the document that you would take a full RDF dump and then
compute the RDF delta from a previous backup.  Talking from the experience
of having done this as part of one of my experiments in my PhD this can be
very complex and time consuming to do especially if you need to take care
of BNode isomorphism.  I assume from some of the other discussion on
BNodes that you assume that IDs will remain stable across dumps, thus
there is an implicit requirement here that the database be able to dump
RDF using consistent BNode IDs (either internal IDs or some stable round
trippable IDs).  Taking ARQ as an example the existing NQuads/TriG writers
do not do this so there would need to be an option for those writers to be
able to support this.

Even without any concerns of BNode isomorphism comparing two RDF dumps to
create a delta could be a potentially very time consuming operation and
recording the deltas as changes happen may be far more efficient.  Of
course depending on the exact use case the RDF dump and compute delta
approach may be acceptable.

My main criticism is on the "Minimise actions" section, there needs to be
a more solid clarification of definitions and when minimization can and
should happen.

For example:

"When written in minimise form the RDF Delta can be run backwards, to undo
a change. This only works when real changes are recorded because otherwise
knowing a triple is added does not mean it was not there before."

While I agree it is necessary to record real changes for deltas to be
reverse applied I'm not convinced they have to be in minimized form (at
least based on how the definition of minimized form reads right now), if
only real changes are recorded then deltas will be in a minimal form.

Yet it is not entirely clear by your definition the following delta would
be considered minimal:

A <http://s> <http://p> <http://o>
R <http://s> <http://p> <http://o>
A <http://s> <http://p> <http://o>

I'm assuming that your intention was that such deltas should not be
minimized but perhaps this needs to be more clear in the document.

On the topic of related work:

I think I may have mentioned previously that I've done some research work
internally here at YarcData on a general purpose binary serialization for
Triples, Quads and Tuples which likely could be fairly trivially extended
to carry a binary encoding of the deltas as well which may save space.
For ball park comparison purposes compression is roughly equivalent to
GZipping raw NTriples with the key advantage being that the format is
significantly faster to process even in its current prototype single
threaded implementation (the design was written to take advantage of
parallelism).  There are a bunch of further optimizations that I had ideas
for that I never got as far as implementing because of lack of management
support for the concept.

There has been some discussion of open sourcing this work (likely as a
contributed Experimental module to Jena) so that it could be developed
outside of the company, if this sounds like it may be of interest I will
broach the subject with relevant management again and see whether this can
happen in the near future.

Rob


On 6/18/13 7:26 AM, "Andy Seaborne" <an...@apache.org> wrote:

>I started writing up a format for transferring changes between dataset
>copies (copies in time and in location).
>
>https://cwiki.apache.org/confluence/display/JENA/RDF+Delta
>
>Still rough and ready but I hope it gives a general impression of the
>format and usage.
>
>Comments, thoughts, discussion here on dev@ please.
>
>	Andy