You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "ajs6f@virginia.edu" <aj...@virginia.edu> on 2015/07/23 15:18:15 UTC

Journaling DatasetGraph

After a longish conversation with Andy Seaborne, I've worked up a simple journaling DatasetGraph wrapping implementation. The idea is to use journaling to support proper aborting behavior (which I believe this code does) and to add to that a semantic for DatasetGraph::addGraph that copies tuples instead of leaving a reference to the added Graph (which I believe this code also does). Between these two behaviors, the idea is to be able to support transactionality (MRSW only) reasonably well.

The idea is (if this code looks like a reasonable direction) to move onwards to an implementation that uses persistent data structures for covering indexes in order to get at least to MR+SW and eventually to attack JENA-624: "Develop a new in-memory RDF Dataset implementation".

Feedback / advice / criticism greedily desired and welcome!

https://github.com/ajs6f/jena/tree/JournalingDatasetgraph

https://github.com/apache/jena/compare/master...ajs6f:JournalingDatasetgraph

---
A. Soroka
The University of Virginia Library

Re: Journaling DatasetGraph

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

Thanks for the feedback Andy.

> 2/
> Datasets that provide support for MW cases and don't provide transactions seem rather unlikely so may be document what kind of DatasetGraph is being supported by DatasetGraphWithRecord then just use the underlying lock..

Okay, that's certainly simpler! And it keeps my grubby fingers out of Lock. {grin} 

> 3/
> There are two thing to protect in DatasetGraphWithRecord : the underlying dataset and transaction log for supporting abort for writers only.  They can have separate mechanisms.  Use the dataset lock for the DatasetGraph actions and make the transaction undo log operations be safe by other means.

You mean an independent lock visible only inside DatasetGraphWithRecord?

> .. hmm ... the order of entries in the log may matter so true parallel MW looks increasing hard to deal with anyway.  Document and not worry for now?

My fear has been that MW means

a) a log per write-transaction and connections from the transaction to a particular set of states for the indexes
b) with those "forward" states invisible outside the transaction
c) and all the nightmare fun of merging states!

---
A. Soroka
The University of Virginia Library

On Aug 4, 2015, at 4:32 PM, Andy Seaborne <an...@apache.org> wrote:

> On 03/08/15 17:13, ajs6f@virginia.edu wrote:
>> I've made some emendations to (hopefully) fix this problem. In order to so do, I added a method to Lock itself to report the quality of an instance, simply as an enumeration. I had hoped to avoid touching any of the extant code, but because Lock is a public type that can be instantiated by anyone, I just can't see how to resolve this problem without some way for a Lock to categorize itself independently of the type system's inheritance.
>> 
>> Feedback welcome!
> 
> A few things occur to me:
> 
> 1/
> The transaction log is for supporting abort for writers only.  Nothing needs to be done in DatasetGraphWithRecord for readers. DatasetGraphWithLock does what's needed.  So you don't even need to startRecording for a READ (and the commit clear - _end always aborts is an interesting way to do it!).
> 
> 2/
> Datasets that provide support for MW cases and don't provide transactions seem rather unlikely so may be document what kind of DatasetGraph is being supported by DatasetGraphWithRecord then just use the underlying lock..
> 
> It's not just a case of using ConcurrentHashMap, say, as likely there would be multiple of them for different indexes and that would give weird consistency issues as different parts get updated safely with respect to part of the datastructure but it will be visibly different depending on what the reader uses.  So I think MW will have additional coordination.
> 
> 3/
> 
> There are two thing to protect in DatasetGraphWithRecord : the underlying dataset and transaction log for supporting abort for writers only.  They can have separate mechanisms.  Use the dataset lock for the DatasetGraph actions and make the transaction undo log operations be safe by other means.
> 
> .. hmm ... the order of entries in the log may matter so true parallel MW looks increasing hard to deal with anyway.  Document and not worry for now?
> 
> 	Andy
> 
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Jul 29, 2015, at 5:04 PM, Andy Seaborne <an...@apache.org> wrote:
>> 
>>> The lock provided by the underlying dataset may matter.  DatasetGraphs support critical sections.  DatasetGraphWithLock uses critical sections of the underlying dataset.
>>> 
>>> I gave an (hypothetical) example where the lock must be more restrictive than ReentrantReadWriteLock (LockMRSW is a ReentrantReadWriteLock + counting support to catch application errors).
>>> 
>>> DatasetGraphWithRecord is relying on single-W for its own datastructures.
>>> 
>>> 	Andy
>>> 
>>> On 29/07/15 21:22, ajs6f@virginia.edu wrote:
>>>> I'm not sure I understand this advice-- are you saying that because no DatasetGraph can be assumed to support MR, there isn't any point in trying to support MR at the level of DatasetGraphWithRecord? That would seem to make my whole effort a bit pointless.
>>>> 
>>>> Or are you saying that because, in practice, all DatasetGraphs _do_ support MR, there's no need to enforce it at the level of DatasetGraphWithRecord?
>>>> 
>>>> ---
>>>> A. Soroka
>>>> The University of Virginia Library
>>>> 
>>>> On Jul 29, 2015, at 4:14 PM, Andy Seaborne <an...@apache.org> wrote:
>>>> 
>>>>> On 27/07/15 18:06, ajs6f@virginia.edu wrote:
>>>>>>> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
>>>>>> Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})
>>>>>> 
>>>>> 
>>>>> A DatasetGraph provides whatever lock is suitable to meet the contract of concurrency [1]
>>>>> 
>>>>> Some implementations (there aren't any) may not even be able to support true parallel readers (for example, datastructures that they may make internal changes even in read operations like moving recently accessed items to the top or caching computation needed for read).
>>>>> 
>>>>> There aren't any (the rules are R-safe) - locks are always LockMRSW.
>>>>> 
>>>>> [1] http://jena.apache.org/documentation/notes/concurrency-howto.html
>>>>> 
>>>>> 	Andy
>>>>> 
>>>> 
>>> 
>> 
>

Re: Journaling DatasetGraph

Posted by Andy Seaborne <an...@apache.org>.

On 03/08/15 17:13, ajs6f@virginia.edu wrote:
> I've made some emendations to (hopefully) fix this problem. In order to so do, I added a method to Lock itself to report the quality of an instance, simply as an enumeration. I had hoped to avoid touching any of the extant code, but because Lock is a public type that can be instantiated by anyone, I just can't see how to resolve this problem without some way for a Lock to categorize itself independently of the type system's inheritance.
>
> Feedback welcome!

A few things occur to me:

1/
The transaction log is for supporting abort for writers only.  Nothing 
needs to be done in DatasetGraphWithRecord for readers. 
DatasetGraphWithLock does what's needed.  So you don't even need to 
startRecording for a READ (and the commit clear - _end always aborts is 
an interesting way to do it!).

2/
Datasets that provide support for MW cases and don't provide 
transactions seem rather unlikely so may be document what kind of 
DatasetGraph is being supported by DatasetGraphWithRecord then just use 
the underlying lock..

It's not just a case of using ConcurrentHashMap, say, as likely there 
would be multiple of them for different indexes and that would give 
weird consistency issues as different parts get updated safely with 
respect to part of the datastructure but it will be visibly different 
depending on what the reader uses.  So I think MW will have additional 
coordination.

3/

There are two thing to protect in DatasetGraphWithRecord : the 
underlying dataset and transaction log for supporting abort for writers 
only.  They can have separate mechanisms.  Use the dataset lock for the 
DatasetGraph actions and make the transaction undo log operations be 
safe by other means.

.. hmm ... the order of entries in the log may matter so true parallel 
MW looks increasing hard to deal with anyway.  Document and not worry 
for now?

	Andy

>
> ---
> A. Soroka
> The University of Virginia Library
>
> On Jul 29, 2015, at 5:04 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> The lock provided by the underlying dataset may matter.  DatasetGraphs support critical sections.  DatasetGraphWithLock uses critical sections of the underlying dataset.
>>
>> I gave an (hypothetical) example where the lock must be more restrictive than ReentrantReadWriteLock (LockMRSW is a ReentrantReadWriteLock + counting support to catch application errors).
>>
>> DatasetGraphWithRecord is relying on single-W for its own datastructures.
>>
>> 	Andy
>>
>> On 29/07/15 21:22, ajs6f@virginia.edu wrote:
>>> I'm not sure I understand this advice-- are you saying that because no DatasetGraph can be assumed to support MR, there isn't any point in trying to support MR at the level of DatasetGraphWithRecord? That would seem to make my whole effort a bit pointless.
>>>
>>> Or are you saying that because, in practice, all DatasetGraphs _do_ support MR, there's no need to enforce it at the level of DatasetGraphWithRecord?
>>>
>>> ---
>>> A. Soroka
>>> The University of Virginia Library
>>>
>>> On Jul 29, 2015, at 4:14 PM, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> On 27/07/15 18:06, ajs6f@virginia.edu wrote:
>>>>>> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
>>>>> Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})
>>>>>
>>>>
>>>> A DatasetGraph provides whatever lock is suitable to meet the contract of concurrency [1]
>>>>
>>>> Some implementations (there aren't any) may not even be able to support true parallel readers (for example, datastructures that they may make internal changes even in read operations like moving recently accessed items to the top or caching computation needed for read).
>>>>
>>>> There aren't any (the rules are R-safe) - locks are always LockMRSW.
>>>>
>>>> [1] http://jena.apache.org/documentation/notes/concurrency-howto.html
>>>>
>>>> 	Andy
>>>>
>>>
>>
>

Re: Journaling DatasetGraph

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

I've made some emendations to (hopefully) fix this problem. In order to so do, I added a method to Lock itself to report the quality of an instance, simply as an enumeration. I had hoped to avoid touching any of the extant code, but because Lock is a public type that can be instantiated by anyone, I just can't see how to resolve this problem without some way for a Lock to categorize itself independently of the type system's inheritance.

Feedback welcome!

---
A. Soroka
The University of Virginia Library

On Jul 29, 2015, at 5:04 PM, Andy Seaborne <an...@apache.org> wrote:

> The lock provided by the underlying dataset may matter.  DatasetGraphs support critical sections.  DatasetGraphWithLock uses critical sections of the underlying dataset.
> 
> I gave an (hypothetical) example where the lock must be more restrictive than ReentrantReadWriteLock (LockMRSW is a ReentrantReadWriteLock + counting support to catch application errors).
> 
> DatasetGraphWithRecord is relying on single-W for its own datastructures.
> 
> 	Andy
> 
> On 29/07/15 21:22, ajs6f@virginia.edu wrote:
>> I'm not sure I understand this advice-- are you saying that because no DatasetGraph can be assumed to support MR, there isn't any point in trying to support MR at the level of DatasetGraphWithRecord? That would seem to make my whole effort a bit pointless.
>> 
>> Or are you saying that because, in practice, all DatasetGraphs _do_ support MR, there's no need to enforce it at the level of DatasetGraphWithRecord?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Jul 29, 2015, at 4:14 PM, Andy Seaborne <an...@apache.org> wrote:
>> 
>>> On 27/07/15 18:06, ajs6f@virginia.edu wrote:
>>>>> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
>>>> Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})
>>>> 
>>> 
>>> A DatasetGraph provides whatever lock is suitable to meet the contract of concurrency [1]
>>> 
>>> Some implementations (there aren't any) may not even be able to support true parallel readers (for example, datastructures that they may make internal changes even in read operations like moving recently accessed items to the top or caching computation needed for read).
>>> 
>>> There aren't any (the rules are R-safe) - locks are always LockMRSW.
>>> 
>>> [1] http://jena.apache.org/documentation/notes/concurrency-howto.html
>>> 
>>> 	Andy
>>> 
>> 
>

Re: Journaling DatasetGraph

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

I think I understand the problem now. Assuming I do, I see two cases:

1) The underlying dataset has locking that is _more_ restrictive that MRSW, in which case DatasetGraphWithRecord must expose that locking, lest it break the underlying impl.

2) The underlying dataset has locking that is _less_ restrictive that MRSW, in which case DatasetGraphWithRecord must eclipse that locking, lest it break DatasetGraphWithRecord's impl.

So my task is to adopt some careful meaning for "more" and "less" as used above and use it to make DatasetGraphWithRecord's locking more intelligent. I do not see anything in Jena that would answer to the purpose, but maybe I am missing something. {fingers-crossed}

---
A. Soroka
The University of Virginia Library

On Jul 29, 2015, at 5:04 PM, Andy Seaborne <an...@apache.org> wrote:

> The lock provided by the underlying dataset may matter.  DatasetGraphs support critical sections.  DatasetGraphWithLock uses critical sections of the underlying dataset.
> 
> I gave an (hypothetical) example where the lock must be more restrictive than ReentrantReadWriteLock (LockMRSW is a ReentrantReadWriteLock + counting support to catch application errors).
> 
> DatasetGraphWithRecord is relying on single-W for its own datastructures.
> 
> 	Andy
> 
> On 29/07/15 21:22, ajs6f@virginia.edu wrote:
>> I'm not sure I understand this advice-- are you saying that because no DatasetGraph can be assumed to support MR, there isn't any point in trying to support MR at the level of DatasetGraphWithRecord? That would seem to make my whole effort a bit pointless.
>> 
>> Or are you saying that because, in practice, all DatasetGraphs _do_ support MR, there's no need to enforce it at the level of DatasetGraphWithRecord?
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
>> On Jul 29, 2015, at 4:14 PM, Andy Seaborne <an...@apache.org> wrote:
>> 
>>> On 27/07/15 18:06, ajs6f@virginia.edu wrote:
>>>>> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
>>>> Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})
>>>> 
>>> 
>>> A DatasetGraph provides whatever lock is suitable to meet the contract of concurrency [1]
>>> 
>>> Some implementations (there aren't any) may not even be able to support true parallel readers (for example, datastructures that they may make internal changes even in read operations like moving recently accessed items to the top or caching computation needed for read).
>>> 
>>> There aren't any (the rules are R-safe) - locks are always LockMRSW.
>>> 
>>> [1] http://jena.apache.org/documentation/notes/concurrency-howto.html
>>> 
>>> 	Andy
>>> 
>> 
>

Re: Journaling DatasetGraph

Posted by Andy Seaborne <an...@apache.org>.

The lock provided by the underlying dataset may matter.  DatasetGraphs 
support critical sections.  DatasetGraphWithLock uses critical sections 
of the underlying dataset.

I gave an (hypothetical) example where the lock must be more restrictive 
than ReentrantReadWriteLock (LockMRSW is a ReentrantReadWriteLock + 
counting support to catch application errors).

DatasetGraphWithRecord is relying on single-W for its own datastructures.

	Andy

On 29/07/15 21:22, ajs6f@virginia.edu wrote:
> I'm not sure I understand this advice-- are you saying that because no DatasetGraph can be assumed to support MR, there isn't any point in trying to support MR at the level of DatasetGraphWithRecord? That would seem to make my whole effort a bit pointless.
>
> Or are you saying that because, in practice, all DatasetGraphs _do_ support MR, there's no need to enforce it at the level of DatasetGraphWithRecord?
>
> ---
> A. Soroka
> The University of Virginia Library
>
> On Jul 29, 2015, at 4:14 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 27/07/15 18:06, ajs6f@virginia.edu wrote:
>>>> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
>>> Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})
>>>
>>
>> A DatasetGraph provides whatever lock is suitable to meet the contract of concurrency [1]
>>
>> Some implementations (there aren't any) may not even be able to support true parallel readers (for example, datastructures that they may make internal changes even in read operations like moving recently accessed items to the top or caching computation needed for read).
>>
>> There aren't any (the rules are R-safe) - locks are always LockMRSW.
>>
>> [1] http://jena.apache.org/documentation/notes/concurrency-howto.html
>>
>> 	Andy
>>
>

Re: Journaling DatasetGraph

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

I'm not sure I understand this advice-- are you saying that because no DatasetGraph can be assumed to support MR, there isn't any point in trying to support MR at the level of DatasetGraphWithRecord? That would seem to make my whole effort a bit pointless.

Or are you saying that because, in practice, all DatasetGraphs _do_ support MR, there's no need to enforce it at the level of DatasetGraphWithRecord?

---
A. Soroka
The University of Virginia Library

On Jul 29, 2015, at 4:14 PM, Andy Seaborne <an...@apache.org> wrote:

> On 27/07/15 18:06, ajs6f@virginia.edu wrote:
>>> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
>> Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})
>> 
> 
> A DatasetGraph provides whatever lock is suitable to meet the contract of concurrency [1]
> 
> Some implementations (there aren't any) may not even be able to support true parallel readers (for example, datastructures that they may make internal changes even in read operations like moving recently accessed items to the top or caching computation needed for read).
> 
> There aren't any (the rules are R-safe) - locks are always LockMRSW.
> 
> [1] http://jena.apache.org/documentation/notes/concurrency-howto.html
> 
> 	Andy
>

Re: Journaling DatasetGraph

Posted by Andy Seaborne <an...@apache.org>.

On 27/07/15 18:06, ajs6f@virginia.edu wrote:
>> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
> Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})
>

A DatasetGraph provides whatever lock is suitable to meet the contract 
of concurrency [1]

Some implementations (there aren't any) may not even be able to support 
true parallel readers (for example, datastructures that they may make 
internal changes even in read operations like moving recently accessed 
items to the top or caching computation needed for read).

There aren't any (the rules are R-safe) - locks are always LockMRSW.

[1] http://jena.apache.org/documentation/notes/concurrency-howto.html

	Andy

Re: Journaling DatasetGraph

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

Thanks for the feedback, Andy! See comment in-line below.

---
A. Soroka
The University of Virginia Library

On Jul 25, 2015, at 7:43 AM, Andy Seaborne <an...@apache.org> wrote:

> A first look - there's quite a lot to do with the release at the moment.

Right, I don't expect anyone to get around to much consideration of this until that is over. Good luck!

> Having a separate set of functionality to the underlying DatasetGraph is good for the MRSW case and with that composition on multiple datasets, text indexes etc etc. For the MR+SW, I think the more connected nature of transactions and implementation might make it harder to have independent functionality but we'll see.

I agree. That's why I did this as a "wrap-around". I don't think MR+SW _can_ be done that way, but we'll see…

> Yes - addGraph ought to be a copy.  The general dataset where the app can put together a collection of different graph types is the exception but needed for the case of some graphs being inference, maybe some not.

As I wrote, I believe that my current code does this solidly and the test shows it, but I'm not sure that the impl is as efficient as possible. Suggestions welcome!

> One of the things that strikes me is that extending Quad to be a QuadOperation breaks being a Quad.  It adds functionality a quad does not have.  Two quads are equal if they have the same G/S/P/O and that's not true for QuadOperation.
> An operation is a pair - the action and the data - not data.

I'm not sure I understand the objection here: all classes inherit from Object and virtually all of them add functionality Object does not have and break its equality definition. I certainly understand the view on operations you're taking, but I'm proposing a different one that includes data, action (in my code, that comes in the form of type, not an enumeration, so that I can replace cases in your code with polymorphism) _and_ service type. Adding a quad to a special index might be substantially different than adding it to a dataset.

> e.g. Putting a QuadOperation into a DatasetGraph would cause problems.

Because of the equality question? I _think_ I understand this objection; are you saying that logic for things like DatasetGraph::contains becomes problematic? To my mind it implies a more sophisticated type of comparison (using equivalence and not equals()) instead of a different kind of data structure. I'll try to make some corrections to show what I mean and give you something to react to. I may be wrong here, but I'd like to follow out the idea.

> ListBackedOperationRecord<OpType> extends ReversibleOperationRecord<OpType>
> 
> public class ListBackedOperationRecord<OpType extends InvertibleOperation<?, ?, ?, ?>>
> 		implements ReversibleOperationRecord<OpType> {
> 
> while, yes, a collection of operations could be an operation datasets don't provide such composite operations so the abstraction is not used.  And the reverse of it would be recursive - each operation needs reversing.

I am _not_ making the claim here that "a collection of operations could be an operation". A record (in my code) is just a record. It is _not_ usable as an aggregate operation and doesn't subtype Operation. There is no use of records as operations nor any intended such use, so no problem. 

> I'd keep log (= list of operations) as a separate concept from the operations themselves.  One key operation of a ListBackedOperationRecord is clear and Operations are Or this is a naming thing, is "record" the log entry or the log itself?

Something seems to have been eaten out of your mail (!) but anyway, a record _is_ a separate concept from operation. There is ReversibleOperationRecord and there is Operation and the only relationship between them is that Operation is a parameter type for ReversibleOperationRecord::add and part of the parameter type for ReversibleOperationRecord::consume. As far as names, I'm not sure what you mean-- ReversibleOperationRecord the type? That's a log. It contains Operations, but _is not one itself_. 

> Is there some specific reason as to why you override the DatasetGraphWithLock lock?

Yes, because DatasetGraphWithLock has no Lock that I could find, and it inherits getLock() from DatasetGraphTrackActive, which just pulls the lock from the wrapped DatasetGraph. I wanted to make sure that a MRSW Lock is in play. But maybe I am misunderstanding the interaction here? (No surprise! {grin})

> One difference is the notion of reversing an operation is not a feature of the operation itself, it's the way it is played back.  Partially, this is efficiency (which may not matter) as it reduces the object churn but also it puts undo-playback in one place (e.g. reading and writing from storage, which might be non-heap memory, or a compacted form (or even a disk) for where large+long transactions even on in-memory lead to excessive object use.  Just an idea.

Yeah, I intentionally separated the two (reverse an operation, reverse a series of operation) because:

1) I think the idea of the inverse of an operation is important enough that I want it visible in the type system, hence the verbose but specific signature of InvertibleOperation. I believe that I have correctly written it so that you cannot have an InvertibleOperation without a specific inverse and that inverse must declare your InvertibleOperation as _its_ inverse.

2) Moving the reversal of a single operation into a method on Operation let me use polymorphism, as mentioned above, to make the code in DatasetGraphWithRecord clear, concise, and extensible. You can have more kinds of operations that can occur inside a transaction without changing the idea of an abort: you would subclass DatasetGraphWithRecord and override _add and _delete. I  should have documented that decision more carefully.

3) It is possible to imagine different ways to use journals to support transactions (for example, in the MR+SW case) and I wanted to isolate the two kinds of logic for rerunning a journal (ReversibleOperationRecord::reverse and ::consume) and the operations you do while rerunning it (found in DatasetGraphWithRecord or another type). I'm thinking of it a bit like internal vs. external iteration.

I may be on the wrong track, but I do think keeping the two abstractions separate is a good thing at the start, since it's usually easier to merge abstractions than to separate them. I do perceive the risk of frothing GC, but I'd like to go forward far enough to actually see whether it is a problem before so assuming.

Re: Journaling DatasetGraph

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

> One of the things that strikes me is that extending Quad to be a QuadOperation breaks being a Quad.  It adds functionality a quad does not have.  Two quads are equal if they have the same G/S/P/O and that's not true for QuadOperation.
> An operation is a pair - the action and the data - not data. e.g. Putting a QuadOperation into a DatasetGraph would cause problems.

Andy-- I've thought harder about this and I've realized that whether or not I can make a navel-gazing argument about correctness, the typing is obviously confusing and that's damnation enough. I'll fix this to stop extending Quad.

---
A. Soroka
The University of Virginia Library

On Jul 25, 2015, at 7:43 AM, Andy Seaborne <an...@apache.org> wrote:

> On 23/07/15 14:18, ajs6f@virginia.edu wrote:
>> After a longish conversation with Andy Seaborne, I've worked up a simple journaling DatasetGraph wrapping implementation. The idea is to use journaling to support proper aborting behavior (which I believe this code does) and to add to that a semantic for DatasetGraph::addGraph that copies tuples instead of leaving a reference to the added Graph (which I believe this code also does). Between these two behaviors, the idea is to be able to support transactionality (MRSW only) reasonably well.
>> 
>> The idea is (if this code looks like a reasonable direction) to move onwards to an implementation that uses persistent data structures for covering indexes in order to get at least to MR+SW and eventually to attack JENA-624: "Develop a new in-memory RDF Dataset implementation".
>> 
>> Feedback / advice / criticism greedily desired and welcome!
>> 
>> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
>> 
>> https://github.com/apache/jena/compare/master...ajs6f:JournalingDatasetgraph
>> 
>> ---
>> A. Soroka
>> The University of Virginia Library
>> 
> 
> Hi there,
> 
> A first look - there's quite a lot to do with the release at the moment.
> 
> Having a separate set of functionality to the underlying DatasetGraph is good for the MRSW case and with that composition on multiple datasets, text indexes etc etc.
> 
> For the MR+SW, I think the more connected nature of transactions and implementation might make it harder to have independent functionality but we'll see.
> 
> https://github.com/afs/mantis/tree/master/dboe-transaction
> is a take on a trasnaction mechanism.  I'm using it at the moment so I'm finding otu what works ... and what does not.
> 
> 
> Yes - addGraph ought to be a copy.  The general dataset where the app can put together a collection of different graph types is the exception but needed for the case of some graphs being inference, maybe some not.
> 
> 
> One of the things that strikes me is that extending Quad to be a QuadOperation breaks being a Quad.  It adds functionality a quad does not have.  Two quads are equal if they have the same G/S/P/O and that's not true for QuadOperation.
> 
> An operation is a pair - the action and the data - not data.
> 
> e.g. Putting a QuadOperation into a DatasetGraph would cause problems.
> 
> 
> ListBackedOperationRecord<OpType> extends ReversibleOperationRecord<OpType>
> 
> [[
> public class ListBackedOperationRecord<OpType extends InvertibleOperation<?, ?, ?, ?>>
> 		implements ReversibleOperationRecord<OpType> {
> ]]
> 
> 
> while, yes, a collection of operations could be an operation, datasets don't provide such composite operations so the abstraction is not used.  And the reverse of it would be recursive - each operation needs reversing.
> 
> I'd keep log (= list of operations) as a separate concept from the operations themselves.  One key operation of a ListBackedOperationRecord is clear and Operations are
> 
> Or this is a naming thing, is "record" the log entry or the log itself?
> 
> 
> Is there some specific reason as to why you override the DatasetGraphWithLock lock?
> 
> 
> My take on this is:
> 
> https://github.com/afs/jena-workspace/tree/master/src/main/java/transdsg
> 
> One difference is the notion of reversing an operation is not a feature of the operation itself, it's the way it is played back.  Partially, this is efficiency (which may not matter) as it reduces the object churn but also it puts undo-playback in one place (e.g. reading and writing from storage, which might be non-heap memory, or a compacted form (or even a disk) for where large+long transactions even on in-memory lead to excessive object use.  Just an idea.
> 
> 	Andy
>

Re: Journaling DatasetGraph

Posted by Andy Seaborne <an...@apache.org>.

On 23/07/15 14:18, ajs6f@virginia.edu wrote:
> After a longish conversation with Andy Seaborne, I've worked up a simple journaling DatasetGraph wrapping implementation. The idea is to use journaling to support proper aborting behavior (which I believe this code does) and to add to that a semantic for DatasetGraph::addGraph that copies tuples instead of leaving a reference to the added Graph (which I believe this code also does). Between these two behaviors, the idea is to be able to support transactionality (MRSW only) reasonably well.
>
> The idea is (if this code looks like a reasonable direction) to move onwards to an implementation that uses persistent data structures for covering indexes in order to get at least to MR+SW and eventually to attack JENA-624: "Develop a new in-memory RDF Dataset implementation".
>
> Feedback / advice / criticism greedily desired and welcome!
>
> https://github.com/ajs6f/jena/tree/JournalingDatasetgraph
>
> https://github.com/apache/jena/compare/master...ajs6f:JournalingDatasetgraph
>
> ---
> A. Soroka
> The University of Virginia Library
>

Hi there,

A first look - there's quite a lot to do with the release at the moment.

Having a separate set of functionality to the underlying DatasetGraph is 
good for the MRSW case and with that composition on multiple datasets, 
text indexes etc etc.

For the MR+SW, I think the more connected nature of transactions and 
implementation might make it harder to have independent functionality 
but we'll see.

https://github.com/afs/mantis/tree/master/dboe-transaction
is a take on a trasnaction mechanism.  I'm using it at the moment so I'm 
finding otu what works ... and what does not.

Yes - addGraph ought to be a copy.  The general dataset where the app 
can put together a collection of different graph types is the exception 
but needed for the case of some graphs being inference, maybe some not.

One of the things that strikes me is that extending Quad to be a 
QuadOperation breaks being a Quad.  It adds functionality a quad does 
not have.  Two quads are equal if they have the same G/S/P/O and that's 
not true for QuadOperation.

An operation is a pair - the action and the data - not data.

e.g. Putting a QuadOperation into a DatasetGraph would cause problems.

ListBackedOperationRecord<OpType> extends ReversibleOperationRecord<OpType>

[[
public class ListBackedOperationRecord<OpType extends 
InvertibleOperation<?, ?, ?, ?>>
		implements ReversibleOperationRecord<OpType> {
]]

while, yes, a collection of operations could be an operation, datasets 
don't provide such composite operations so the abstraction is not used. 
  And the reverse of it would be recursive - each operation needs reversing.

I'd keep log (= list of operations) as a separate concept from the 
operations themselves.  One key operation of a ListBackedOperationRecord 
is clear and Operations are

Or this is a naming thing, is "record" the log entry or the log itself?

Is there some specific reason as to why you override the 
DatasetGraphWithLock lock?

My take on this is:

https://github.com/afs/jena-workspace/tree/master/src/main/java/transdsg

One difference is the notion of reversing an operation is not a feature 
of the operation itself, it's the way it is played back.  Partially, 
this is efficiency (which may not matter) as it reduces the object churn 
but also it puts undo-playback in one place (e.g. reading and writing 
from storage, which might be non-heap memory, or a compacted form (or 
even a disk) for where large+long transactions even on in-memory lead to 
excessive object use.  Just an idea.

	Andy