You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by "Paolo Castagna (JIRA)" <ji...@apache.org> on 2011/02/06 16:57:30 UTC

[jira] Created: (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Different policy for concurrency access in TDB supporting a single writer and multiple readers
----------------------------------------------------------------------------------------------

                 Key: JENA-41
                 URL: https://issues.apache.org/jira/browse/JENA-41
             Project: Jena
          Issue Type: New Feature
          Components: Fuseki, TDB
            Reporter: Paolo Castagna


As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.

Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
This works fine in most of the situation, but there might be problems in presence of long writes or long reads.

It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.

 [1] http://markmail.org/message/jnqm6pn32df4wgte
 [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne resolved JENA-41.
-------------------------------

    Resolution: Fixed

Marking as "fixed" because transactional TDB has been released a a prototype.  It provides MR+SW.




> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>            Assignee: Andy Seaborne
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12991232#comment-12991232 ] 

Andy Seaborne commented on JENA-41:
-----------------------------------

Paolo wrote in jena-users@i.a.o
> Would the "journaled file access" approach you have suggested be a 
> solution for avoiding long reads forbidding/delaying writes?

It's one of the things you can do. It's a form of transactions, yet, for read-access it retains the current performance model of TDB.

It allows reading and writing at the same time, up until the point the writes need to be committed.  At that point, optionally, write the changes to disk as a journal log, also needed if the amount of change is very large (see a related discussion on the dev list), then take a lock to get exclusive access to the dataset.  Write changes (with the opportunity to do write ordering for improved performance), then release the exclusive lock.

In this scheme, a slow writer does not lock out readers, and readers do not lock out the application writing changes.  There is still an exclusive window but it is shorter and not dependent on unhelpfully slow write clients.

If the journal log is persistent, then recovery from crashes during the commit stage is a natural addition.  A non-persistent journal is only useful for speed, and given the journal can be written and read in an efficient style, I would not have thought it would be hugely useful.

Alternatives include optimistic concurrency based scheme but the possibility of clients needing to repeat actions is likely to be a burden so there would need to be a much bigger advantage to outweigh the complexity.  

	Andy


> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

Re: Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by Andy Seaborne <an...@epimorphics.com>.

On 30/03/11 22:46, Stephen Allen wrote:
> Andy,

> As an aside, I recall you mentioning that you had a BDB version of
> TDB, using that would seem to offer a fast, stable way of adding
> transactions to your B-trees.  Out of curiosity, were there problems
> with using BDB?

https://github.com/afs/TDB-BDB

No problems as such but it just isn't very fast (non-transactionally). 
There is no bulk loading advantage at all, and query performance was 
slower but OK.  That's before turning on transactions.  As the data 
scaled, the difference between TDB native and TDB-BDB became more 
pronounced.

BDB-C and BDB-JE are about the same speed.

Given they were already slower, and for TxTDB, I want to retain 
reader-performance, that doesn't look like a good starting point.

It might be a good place for a version with different goals - less 
emphasis on scale, more on high-frequency writer (and less reads), for 
example a sensor data hub.

I don't know why they are slower but I speculate that the general 
purpose design of both BDBs  (e.g. fully variable length key and value, 
node size, overhead in the tree blocks for all sorts of features not 
used) means it is optimized for something else. BDB is designed for 
highed-write concurrency - RDF datastores are for publishing (read 
dominant).  Sometimes these design objectives pull in different directions.

I used BDB to store the string table as well (lexical forms of nodes). 
It was better to use a native string file.

Maybe it's a case of not using them to their best advantage.

tdbloader1 simply does the loading work in an order that is better than 
adding triples one at a time, inbexsing as you go.  It loads the primary 
index, then builds the secondary indexes by copying from the primary. 
That applies to BDB but it didn't help.

tdbloader2 uses Unix sort(1) to prepare the index data by sorting into 
the order for each index, then writes the B+Trees directly to disk (from 
the bottom up and very carefully).

	Andy

RE: Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by Stephen Allen <sa...@bbn.com>.

Andy,

I'm still in the design process, so this could change.  But my plan for now is to use BDB's logging for the resource dictionary and our own logging for the statement and resource tables.  The database "element" for those two tables will be a single tuple (i.e. a row).  These tables are not append only, statements may have their expiration XID updated and resources will have their linked list and statistics updated.  As I mentioned, I am focusing on single-writer for now, so that simplifies a lot of the logging and concurrency design.

I presently can imagine three options for implementing durability in Parliament (+Pros and -Cons):

1) Stick with memory mapped I/O by using an undo/redo log (records old value and new value for each element)
     + Memory mapping is high performance database elements
     + I/O system is implemented this way already
     - Need to deal with the possibility that OS can write changes before they are flushed in the log.  Two methods:
        a) All writes that would go to pages are instead buffered in memory until the log is flushed periodically then copied to the actual pages
        b) Make use of functions to lock modified pages in memory until log is flushed (Windows has VirtualLock(), Linux has mlock())

2) Abandon memory mapped I/O and use our own buffer manager
     + Full control of flushing to disk
     + Can use any logging scheme we want (redo may not be feasible for huge write transactions though)
     - A fair amount of work to implement

3) Use BDB's Queue structure for the statement and resource tables [1]
     + Fast, stable, easy to use instead of implementing our own log
     - Logical record numbers limited to 32-bit unsigned type (eventually limits us to only 4,294,967,295 triples)

I am leaning towards solution 1) at the moment.  The undo/redo log is attractive because of the drawbacks quoted in [2] (pp 869):

- Undo logging requires that data be written to disk immediately after a transaction finishes, perhaps increasing the number of disk I/O's that need to be performed.

- On the other hand, redo logging requires us to keep all modified blocks in buffers until the transaction commits and the log records have been flushed, perhaps increasing the average number of buffers required by transactions.

- Both undo and redo logs may put contradictory requirements on how buffers are handled during a checkpoint, unless the database elements are complete blocks or sets of blocks.

The last point is important because I want to use tuples as my elements (since they are tiny compared to the block, the log size is dramatically reduced even though I am storing both undo and redo information).  The rule for undo/redo is simply that the update log record appears on disk before any database elements are modified on disk, and one of the implementations of 1) can do that.

As an aside, I recall you mentioning that you had a BDB version of TDB, using that would seem to offer a fast, stable way of adding transactions to your B-trees.  Out of curiosity, were there problems with using BDB?

-Stephen

P.S. Parliament currently stores triples-table-per-graph, but internally it can accept quads, the interface to Jena needs to be changed to support quads however.  Ultimately I think we may have as a runtime option what method is desired, since triples-table-per-graph offers some advantages when you have a small number of large graphs that you can then drop very efficiently.

[1] http://pybsddb.sourceforge.net/ref/am_conf/logrec.html
[2] http://www.amazon.com/Database-Systems-Complete-Book-2nd/dp/0131873253

-----Original Message-----
From: Andy Seaborne [mailto:andy.seaborne@epimorphics.com] 
Sent: Sunday, March 27, 2011 10:54 AM
To: jena-dev@incubator.apache.org
Subject: Re: Different policy for concurrency access in TDB supporting a single writer and multiple readers

On 22/03/11 23:03, Stephen Allen wrote:
> Hi Andy,
>
> This is rather long email I'm afraid, but I've split it into two
> parts: Parliament and TxTDB comments:
>
>
> Parliament ----------
>
> The unit of MVCC I'm looking at is per quad.  This is similar to
> PostgreSQL's row-level MVCC.  It also means that the WAL is used for
> durability guarantees and transaction rollback rather than
> concurrency control.  The WAL will track statement
> insertions/deletions and disk page states on the first modification
> after a checkpoint.

Stephen,

On the Parliament design to check I understand ...

When a new quad is added to the store during a transaction, the quad is 
written to the WAL.  is anything else written?

The quad needs to be added to the statement list and any new nodes to 
resource table.  Some data structure manipulations need to be done 
atomically with respect to failure.  The data structures are in memory 
mapped files - there is a chance (small) part of the mapped file is 
written back to disk at anytime is the OS decides it needs to swap out 
that page.  Having been recently used by being updated, it's quite 
unlikely to be a candidate but it's possible.

Does Parliament record the lower level statement list and resource table 
changes?  Or is teh adding of a quad and adding of a resource 
idempotent, at least can be repeated without reading from the 
datastructure in some way?

These actions that are repeatable without looking at the original state 
are quite nice from a redo-log perspective.

I can see how it might be possible to have datastructures that are 
insensitive to corruption under append because the repeat is simply to 
put in the right answers without reading the potentially corrupt data.

B+Trees don't have this property - when a tree node splits, three nodes 
need writing (left of split, right of split, parent).  A change to a 
single node is OK because the trees don't support duplicate keys.  But 
if the tree is partially written back in a split then the tree on disk 
is broken and reply does not help - the replay traverses the tree 
assuming it's valid.

The chances of this are small - splits aren't the common change 
operation, and updates close in time are likely to be all written or not 
written, but it's possible to interrupt the writing in a small window.

(side question: the ISWC2009 only describes triples - is that a quads 
table or a triples-table-per-graph. )

	Andy

Re: Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by Andy Seaborne <an...@epimorphics.com>.

On 22/03/11 23:03, Stephen Allen wrote:
> Hi Andy,
>
> This is rather long email I'm afraid, but I've split it into two
> parts: Parliament and TxTDB comments:
>
>
> Parliament ----------
>
> The unit of MVCC I'm looking at is per quad.  This is similar to
> PostgreSQL's row-level MVCC.  It also means that the WAL is used for
> durability guarantees and transaction rollback rather than
> concurrency control.  The WAL will track statement
> insertions/deletions and disk page states on the first modification
> after a checkpoint.

Stephen,

On the Parliament design to check I understand ...

When a new quad is added to the store during a transaction, the quad is 
written to the WAL.  is anything else written?

The quad needs to be added to the statement list and any new nodes to 
resource table.  Some data structure manipulations need to be done 
atomically with respect to failure.  The data structures are in memory 
mapped files - there is a chance (small) part of the mapped file is 
written back to disk at anytime is the OS decides it needs to swap out 
that page.  Having been recently used by being updated, it's quite 
unlikely to be a candidate but it's possible.

Does Parliament record the lower level statement list and resource table 
changes?  Or is teh adding of a quad and adding of a resource 
idempotent, at least can be repeated without reading from the 
datastructure in some way?

These actions that are repeatable without looking at the original state 
are quite nice from a redo-log perspective.

I can see how it might be possible to have datastructures that are 
insensitive to corruption under append because the repeat is simply to 
put in the right answers without reading the potentially corrupt data.

B+Trees don't have this property - when a tree node splits, three nodes 
need writing (left of split, right of split, parent).  A change to a 
single node is OK because the trees don't support duplicate keys.  But 
if the tree is partially written back in a split then the tree on disk 
is broken and reply does not help - the replay traverses the tree 
assuming it's valid.

The chances of this are small - splits aren't the common change 
operation, and updates close in time are likely to be all written or not 
written, but it's possible to interrupt the writing in a small window.

(side question: the ISWC2009 only describes triples - is that a quads 
table or a triples-table-per-graph. )

	Andy

Re: Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by Andy Seaborne <an...@epimorphics.com>.

On 22/03/11 23:03, Stephen Allen wrote:
> Hi Andy,
>
> This is rather long email I'm afraid, but I've split it into two parts: Parliament and TxTDB comments:

Very helpful indeed.  I'll respond to the TxTDB part here -- the 
pArliament description is interesting and helpful as to what you plan to 
do and how you came to your design decisions.

> TxTDB -----
>
> It sounds like you may be describing shadow paging [1].  In shadow
> paging, the basic idea is that any reader or writer gets a view onto
> the data based on reachability from pointers in a particular root
> block.  Pages that are reachable from any live root block are never
> modified.  A vacuum process is required to collect the space from
> blocks that are no longer reachable (the last reader out could do it
> too).  Updates to indexes must be treated in roughly the same way as
> data pages, because they contain pointers to different data.  If you
> wrote each block to disk at commit time the system would belong to
> the class of logless transaction systems (although a small log might
> be needed to prevent partial updates to the root page pointers).
> With a shadow paging system you also would not have to actually copy
> the new pages back over top of the old ones.
>
>
> -Stephen
>
> [1] http://en.wikipedia.org/wiki/Shadow_paging

Sort of shadow paging - the Wikipedia page is describing shadow paging 
where a new page is allocated with new page id, with the need to chase 
back up any datastructre that refers to the page and change that (the 
latter could be in-place or shadow'ed recursed).  It requires data 
structure involvement to chase the references.

I'm thinking of keeping the same page id, and the transactions have a 
context that enables them to go to the right place place to find a given 
page.  The data structures referring to the block do not change; the 
"get block 123" step carries out the looking for the right version.

The B+Tree code is already structured around "get block", "put block" 
operations - one reason this new implementation was done, rather than 
pick up an existing one.

http://www.sqlite.org/wal.html
Section "How WAL Works"

which isn't quite the same WAL as Wikipedia either.  This WAL is a redo 
log without change in-place as the transaction progresses.

I hope that one form of compression for new blocks is to use a 
block-type compressor so adding a triple to a index is not a copy of the 
whole block, but just "shift-insert bytes ABC at offset 123, moving up 
old bytes".  Most changes in a B+Tree are on one block, not causing 
block split or rebalancing.  That level of compress takes the block 
change down to about "insert triple" size, except there are multiple 
indexes.  TDB does not store a distinguished triple as such although one 
index is considered "primary" but they are all cluster indexes at the 
moment.  Primary is just the one the code assumes exists and falls back 
on although currently only total indexing is done so that never happens.

It would also be possible to reverse the whole thing - copy old data out 
of the way and change in-place.  That puts the I/O to the database on 
the writer critical path, which is random access, whereas redo puts 
append+sync on the log as the I/O on the writer critical path.

	Andy

RE: Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by Stephen Allen <sa...@bbn.com>.

Hi Andy,

This is rather long email I'm afraid, but I've split it into two parts: Parliament and TxTDB comments:

Parliament
----------

The unit of MVCC I'm looking at is per quad.  This is similar to PostgreSQL's row-level MVCC.  It also means that the WAL is used for durability guarantees and transaction rollback rather than concurrency control.  The WAL will track statement insertions/deletions and disk page states on the first modification after a checkpoint.

Each statement in Parliament will have two transaction IDs: a creation ID that identifies the transaction that created the statement, and an expiration ID that identifies the transaction that expired (deleted) the statement.  As statements are immutable (statements can only be created or deleted), we can expire a statement by atomically updating the expired ID field and avoid locking.  When a statement is expired, Parliament retains it until all transactions that were in process at the time of expiration have completed (after which time the statement is allowed to be removed by the background vacuum task).

Read iterators can determine if any particular statement is visible by recording two things at the start of the transaction: 1) the current transaction ID, and 2) all in-process transaction IDs.  When a transactional query is issued, Parliament displays all statement versions that match the following criteria:
  - The statement’s creation ID is a committed transaction and is less than the current transaction counter
  - The statement lacks an expiration ID, its expiration ID was in process at query start, or the expiration ID is greater than the current transaction counter

One helpful result of structuring our statement iterator operation in this manner is that we can align our physical data storage with the abstract RDF concept that a resource can be said to only exist if there exists a statement that uses that resource.  If we ensure that the only canonical source for determining the existence of a resource is the Statement List, then we are free to loosen the isolation level of the resource dictionary to that of UNCOMMITTED_READ while retaining an overall SERIALIZABLE isolation for the triple store itself (thus there is no requirement for MVCC on the resource dictionary, and minimal locking requirements).

Because there is no coordination between read and write transactions, it is not easy to know exactly when a particular expired statement can safely be deleted from the statement list.  Borrowing from other MVCC implementations (PostgreSQL in particular), Parliament will use a VACUUM task that will iterate through all statements and remove any links to statements that are considered expired by all open transactions (possibly placing the address in a free-space list for reuse).

Right now I am going to focus on an implementation with only a single writer, but with the idea that multiple writers is possible in the future.  It that aspect that drives me to suspect that I will want more isolation levels than SERIALIZABLE as lower levels could reduce potentially expensive locking.  It could also be useful for improving the throughput of lots of short read queries as well.  UNCOMMITTED_READ for example could avoid the locking needed to get a list of all in-process transactions and instead atomically get and increment the current transaction counter.

Some of these isolation levels may only make sense if the transaction interface could specify the start and end of a query inside of a transaction since at the DatasetGraph level all you get are .find(Quad) operations.  The example I gave on the JIRA would need some way of indicating query start inside of a transaction.

TxTDB
-----

It sounds like you may be describing shadow paging [1].  In shadow paging, the basic idea is that any reader or writer gets a view onto the data based on reachability from pointers in a particular root block.  Pages that are reachable from any live root block are never modified.  A vacuum process is required to collect the space from blocks that are no longer reachable (the last reader out could do it too).  Updates to indexes must be treated in roughly the same way as data pages, because they contain pointers to different data.  If you wrote each block to disk at commit time the system would belong to the class of logless transaction systems (although a small log might be needed to prevent partial updates to the root page pointers).  With a shadow paging system you also would not have to actually copy the new pages back over top of the old ones.

-Stephen

[1] http://en.wikipedia.org/wiki/Shadow_paging

-----Original Message-----
From: Andy Seaborne [mailto:andy.seaborne@epimorphics.com] 
Sent: Tuesday, March 22, 2011 11:06 AM
To: jena-dev@incubator.apache.org
Subject: Different policy for concurrency access in TDB supporting a single writer and multiple readers

Switching to email for general discussion:

Stephen,

Could you say something about your design ideas for transactions and 
concurrency in Parliament?  And specifically, what is the unit of MVCC?

I've mainly been thinking about the design for TDB, and a design that 
uses blocks as the unit of transaction and MVCC, with each writer seeing 
a view of the block space that accords with its initial time point

Using a WAL as a redo log, the changes are not in the main database 
until the WAL is checkpointed.

If you have the time sequence:

R1: start reader 1
W1: start writer 1
W1: finish writer 1
# a reader here is like R1
R2: start reader 2
W2: start writer 2
W2: finish writer 2
R3: start reader 3
... finish queries ...

which is single writer but reader 1 spans two writers.

My idea has been to create an "IP layer for data" (IP as in Internet 
Protocol) over blocks, rather that making the data structures above that 
layer (B+Trees, Extensible hash tables) versioned - the data structure 
code has to be slightly version aware to pass the version context up and 
down to the block layer but the algorithms are essentially not version 
specific.  This depends on making the block layer sufficiently efficient 
which might become the practical transaction size limitation. 
Compression, or recording operations on block according to type (an 
extreme form of application-driven compression), would be very good.  It 
retains the data publishing-focus of TDB - that is, reader and reader 
performance is primary and costs go mainly on the writers.

Only experimentation will tell.  It is my hope to have other data 
structures for indexing in the future so a clean separation of the 
tranasaction/MVCC issues and the data structure algorithms is going to 
be good.

Given that design, SERIALIZABLE is the natural isolation policy.  There 
is very little locking over blocks in the main database (see below for 
the possibility of none!)

That's not to say that the API design should not have these constants in 
even if not used.

It happens naturally, each writer sees its modified blocks (as written 
to the WAL) and the blocks of past committed writers + the base 
database.  Readers just don't get changed blocks; reader 1 is working 
agains the main database; reader 2 gets modified blocks from the WAL 
because the WAL can't be all written to the main database until reader 1 
gets out the way.

READ_COMMITTED makes sense for limiting growth of the WAL as it means 
that the WAL can be flushed to the main database before an earlier 
reader has finished.  It would have to be done in such a way that all 
the changes happen at once or the data structure invariants could be 
invalid; that is, the reader would also have to be outside such an index 
as well and that needs some locking that would not be there otherwise.

REPEATABLE_READ would need block locking for the updates but seems to 
have a quirk. Unfortunately, just knowing which blocks the reader has 
seen may not be strong enough as a change to the blocks of an index may 
result in a change to a block the reader has looked at and one it has 
not in some data structure consistency way, but the block locking will 
miss that.  Not sure that's safe for the B+trees - it's not in general.

READ_UNCOMMITTED is very likely to break the index structures as the 
higher level assumptions in the B+trees will not be valid.

A feature of the TxTDB MVCC design is that locking is not being done on 
the main database.  Changes only happen when writing back committed 
blocks, and then you know the original block to be over-written is never 
seen by a reader.  Write-back occurs only when all readers upto that 
writer version have exited and so any outstanding reader will be getting 
the updated block via the WAL copy. No block lock needed (a bit scary 
that bit :-).

Recovery is a matter of redoing the WAL so it is potentially more 
expensive than an undo log where the undo actions are just running 
transactions, not a period of recent history of writers.  But we don't 
crash do we? :-)

I think that limiting the number of versions back that will be admitted 
to some control constant (like 10 or 20, ideally adapting to few for 
larger tranactions) will be appropriate.  if it's growing, then it's a 
sign the system is being asked to do more than it has the processing 
power to do.

Most of the time, the WAL is write-only, and there are in-memory copies 
of blocks (AKA a cache, with version tiers). This is analogous to the 
direct-mode write cache there is currently, but with an additional 
write-back policy as it needs to sync when a transaction commits.

	Andy

On 21/03/11 22:09, Stephen Allen (JIRA) wrote:
>
> [
> https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009433#comment-13009433
> ]
>
> Stephen Allen commented on JENA-41:
> -----------------------------------
>
> I think your idea about the DatasetGraph being the interface for
> transactions makes sense.  Transactional DatasetGraphs could also
> provide fallback behavior for legacy code by implementing autocommit
> transactions if the user called methods on a dataset that was not
> initialized in a transactionBegin() call.
>
>
> With regard to the isolation levels, I believe some of the lower
> levels can make sense for particular applications or queries.  For
> example say you want to know the size of a few of graphs.
>
>BEGIN READ_ONLY;
 >select count (*) where { graph<http://example/g1> { ?s ?p ?o . } } ;
 >select count (*) where { graph<http://example/g2> { ?s ?p ?o . } } ;
 >COMMIT;
>
> Assuming a traditional pessimistic locking scheme, running the
> transaction at SERIALIZABLE could cause the locks held by the first
> select query to also be held through the second query, reducing
> concurrency (using two transactions instead might not be a good idea
> as there is usually some amount of overhead associated with creating
> and committing transactions).
>
> If you were OK with the possibility that the two query results are
> not truly serializable with respect to each other, then you could
> improve concurrency by using a READ_COMMITTED isolation level instead
> that would give serializable results for each query (but not the
> whole transaction).  And if you really just needed a rough estimate
> of size, using READ_UNCOMMITTED may be able to avoid locking all
> together.
>
> An additional motivating factor for MVCC implementations is that they
> may be implementing snapshot isolation, which probably maps better to
> READ_COMMITTED than SERIALIZABLE (especially if it could do predicate
> locking for true serializable behavior but allow cheaper snapshot
> isolation if that was all that was needed).  The Postgres
> documentation does a good job of describing this [1].
>
> I would find it useful to have multiple isolation levels available
> (even if internally I'm mapping them all to SERIALIZABLE at first).
> The four ANSI Isolation levels seem appropriate, and remember that
> implementations are allowed to map unavailable lower levels to higher
> levels as desired.
>
>
> [1]
> http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html

Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by Andy Seaborne <an...@epimorphics.com>.

Switching to email for general discussion:

Stephen,

Could you say something about your design ideas for transactions and 
concurrency in Parliament?  And specifically, what is the unit of MVCC?

I've mainly been thinking about the design for TDB, and a design that 
uses blocks as the unit of transaction and MVCC, with each writer seeing 
a view of the block space that accords with its initial time point

Using a WAL as a redo log, the changes are not in the main database 
until the WAL is checkpointed.

If you have the time sequence:

R1: start reader 1
W1: start writer 1
W1: finish writer 1
# a reader here is like R1
R2: start reader 2
W2: start writer 2
W2: finish writer 2
R3: start reader 3
... finish queries ...

which is single writer but reader 1 spans two writers.

My idea has been to create an "IP layer for data" (IP as in Internet 
Protocol) over blocks, rather that making the data structures above that 
layer (B+Trees, Extensible hash tables) versioned - the data structure 
code has to be slightly version aware to pass the version context up and 
down to the block layer but the algorithms are essentially not version 
specific.  This depends on making the block layer sufficiently efficient 
which might become the practical transaction size limitation. 
Compression, or recording operations on block according to type (an 
extreme form of application-driven compression), would be very good.  It 
retains the data publishing-focus of TDB - that is, reader and reader 
performance is primary and costs go mainly on the writers.

Only experimentation will tell.  It is my hope to have other data 
structures for indexing in the future so a clean separation of the 
tranasaction/MVCC issues and the data structure algorithms is going to 
be good.

Given that design, SERIALIZABLE is the natural isolation policy.  There 
is very little locking over blocks in the main database (see below for 
the possibility of none!)

That's not to say that the API design should not have these constants in 
even if not used.

It happens naturally, each writer sees its modified blocks (as written 
to the WAL) and the blocks of past committed writers + the base 
database.  Readers just don't get changed blocks; reader 1 is working 
agains the main database; reader 2 gets modified blocks from the WAL 
because the WAL can't be all written to the main database until reader 1 
gets out the way.

READ_COMMITTED makes sense for limiting growth of the WAL as it means 
that the WAL can be flushed to the main database before an earlier 
reader has finished.  It would have to be done in such a way that all 
the changes happen at once or the data structure invariants could be 
invalid; that is, the reader would also have to be outside such an index 
as well and that needs some locking that would not be there otherwise.

REPEATABLE_READ would need block locking for the updates but seems to 
have a quirk. Unfortunately, just knowing which blocks the reader has 
seen may not be strong enough as a change to the blocks of an index may 
result in a change to a block the reader has looked at and one it has 
not in some data structure consistency way, but the block locking will 
miss that.  Not sure that's safe for the B+trees - it's not in general.

READ_UNCOMMITTED is very likely to break the index structures as the 
higher level assumptions in the B+trees will not be valid.

A feature of the TxTDB MVCC design is that locking is not being done on 
the main database.  Changes only happen when writing back committed 
blocks, and then you know the original block to be over-written is never 
seen by a reader.  Write-back occurs only when all readers upto that 
writer version have exited and so any outstanding reader will be getting 
the updated block via the WAL copy. No block lock needed (a bit scary 
that bit :-).

Recovery is a matter of redoing the WAL so it is potentially more 
expensive than an undo log where the undo actions are just running 
transactions, not a period of recent history of writers.  But we don't 
crash do we? :-)

I think that limiting the number of versions back that will be admitted 
to some control constant (like 10 or 20, ideally adapting to few for 
larger tranactions) will be appropriate.  if it's growing, then it's a 
sign the system is being asked to do more than it has the processing 
power to do.

Most of the time, the WAL is write-only, and there are in-memory copies 
of blocks (AKA a cache, with version tiers). This is analogous to the 
direct-mode write cache there is currently, but with an additional 
write-back policy as it needs to sync when a transaction commits.

	Andy

On 21/03/11 22:09, Stephen Allen (JIRA) wrote:
>
> [
> https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009433#comment-13009433
> ]
>
> Stephen Allen commented on JENA-41:
> -----------------------------------
>
> I think your idea about the DatasetGraph being the interface for
> transactions makes sense.  Transactional DatasetGraphs could also
> provide fallback behavior for legacy code by implementing autocommit
> transactions if the user called methods on a dataset that was not
> initialized in a transactionBegin() call.
>
>
> With regard to the isolation levels, I believe some of the lower
> levels can make sense for particular applications or queries.  For
> example say you want to know the size of a few of graphs.
>
>BEGIN READ_ONLY;
 >select count (*) where { graph<http://example/g1> { ?s ?p ?o . } } ;
 >select count (*) where { graph<http://example/g2> { ?s ?p ?o . } } ;
 >COMMIT;
>
> Assuming a traditional pessimistic locking scheme, running the
> transaction at SERIALIZABLE could cause the locks held by the first
> select query to also be held through the second query, reducing
> concurrency (using two transactions instead might not be a good idea
> as there is usually some amount of overhead associated with creating
> and committing transactions).
>
> If you were OK with the possibility that the two query results are
> not truly serializable with respect to each other, then you could
> improve concurrency by using a READ_COMMITTED isolation level instead
> that would give serializable results for each query (but not the
> whole transaction).  And if you really just needed a rough estimate
> of size, using READ_UNCOMMITTED may be able to avoid locking all
> together.
>
> An additional motivating factor for MVCC implementations is that they
> may be implementing snapshot isolation, which probably maps better to
> READ_COMMITTED than SERIALIZABLE (especially if it could do predicate
> locking for true serializable behavior but allow cheaper snapshot
> isolation if that was all that was needed).  The Postgres
> documentation does a good job of describing this [1].
>
> I would find it useful to have multiple isolation levels available
> (even if internally I'm mapping them all to SERIALIZABLE at first).
> The four ANSI Isolation levels seem appropriate, and remember that
> implementations are allowed to map unavailable lower levels to higher
> levels as desired.
>
>
> [1]
> http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html

[jira] [Commented] (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009433#comment-13009433 ] 

Stephen Allen commented on JENA-41:
-----------------------------------

I think your idea about the DatasetGraph being the interface for transactions makes sense.  Transactional DatasetGraphs could also provide fallback behavior for legacy code by implementing autocommit transactions if the user called methods on a dataset that was not initialized in a transactionBegin() call.


With regard to the isolation levels, I believe some of the lower levels can make sense for particular applications or queries.  For example say you want to know the size of a few of graphs.

BEGIN READ_ONLY;
select count (*) where { graph <http://example/g1> { ?s ?p ?o . } } ;
select count (*) where { graph <http://example/g2> { ?s ?p ?o . } } ;
COMMIT;

Assuming a traditional pessimistic locking scheme, running the transaction at SERIALIZABLE could cause the locks held by the first select query to also be held through the second query, reducing concurrency (using two transactions instead might not be a good idea as there is usually some amount of overhead associated with creating and committing transactions).

If you were OK with the possibility that the two query results are not truly serializable with respect to each other, then you could improve concurrency by using a READ_COMMITTED isolation level instead that would give serializable results for each query (but not the whole transaction).  And if you really just needed a rough estimate of size, using READ_UNCOMMITTED may be able to avoid locking all together.

An additional motivating factor for MVCC implementations is that they may be implementing snapshot isolation, which probably maps better to READ_COMMITTED than SERIALIZABLE (especially if it could do predicate locking for true serializable behavior but allow cheaper snapshot isolation if that was all that was needed).  The Postgres documentation does a good job of describing this [1].

I would find it useful to have multiple isolation levels available (even if internally I'm mapping them all to SERIALIZABLE at first).  The four ANSI Isolation levels seem appropriate, and remember that implementations are allowed to map unavailable lower levels to higher levels as desired.


[1] http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html



> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008009#comment-13008009 ] 

Stephen Allen commented on JENA-41:
-----------------------------------

I have been working on adding transactions to Parliament (using a WAL for durability but with MVCC to provide nonblocking readers).  I've been thinking about the interface between ARQ and graph stores and have attached some code and interfaces I have come up with.

I see two main ways to modify the DatasetGraph interface to provide transactions.

1)  Add explicit transaction support to the interface (see TransactionalDataGraph.java).  This means breaking interface changes or a second codepath to deal with these new DataGraphs.  But it has the benefit of cleaner syntax if a single thread is using more than one transaction (some uses for this capability would be: a) eliminate memory/disk buffers in the SPARQL Update code by obtaining sequential transactions for stores that supported snapshot isolation; b) simplify transaction handling if queries were changed not to run on a single thread (either parallel execution or NIO-style asynchronous execution).  See [1] for sample usage with multiple transactions.

2)  Tie transaction information to the current thread (see TransactionHandler.java).  This works well with the current ARQ query execution but becomes more unwieldy when working with multiple transactions on a single thread.  See [2] for example usage.



[1] 
   public void MultipleTransactions1()
    {
        TransactionalDatasetGraph tdsg = ...
        Quad quadToAdd = ...
        
        // Omitted error handling for clarity

        // A read-only transaction
        Transaction t1 = tdsg.getTransactionManager().begin(IsolationLevel.READ_UNCOMMITTED, AccessMode.READ_ONLY);
        
        // A read-write transaction
        Transaction t2 = tdsg.getTransactionManager().begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_WRITE);
        
        System.out.println("t1.size()= " + tdsg.size(t1));
        
        System.out.println("Adding statement in t2");
        tdsg.add(quadToAdd, t2);
        System.out.println("t2.size()= " + tdsg.size(t1));
        t2.commit();
        
        System.out.println("t1.size()= " + tdsg.size(t1));
        t1.commit();
        
        // Output
        // ------
        // t1.size()= 0
        // Adding statement in t2
        // t2.size()= 1
        // t1.size()= 0
    }

[2] 
   public void MultipleTransactions2()
    {
        // Error handling omitted for clarity
        
        DatasetGraph dsg = ...
        Quad quadToAdd = ...
        
        TransactionHandler th = dsg.getTransactionHandler();
        
        // A read-only transaction
        th.begin(IsolationLevel.READ_UNCOMMITTED, AccessMode.READ_ONLY);
        TransactionHandle t1 = th.suspend();
        
        // A read-write transaction
        th.begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_WRITE);
        TransactionHandle t2 = th.suspend();
        
        th.resume(t1);
        System.out.println("t1.size()= " + dsg.size());
        
        t1 = th.suspend();
        th.resume(t2);
        System.out.println("Adding statement in t2");
        dsg.add(quadToAdd);
        System.out.println("t2.size()= " + dsg.size());
        th.commit();
        
        th.resume(t1);
        System.out.println("t1.size()= " + dsg.size());
        th.commit();
        
        // Output
        // ------
        // t1.size()= 0
        // Adding statement in t2
        // t2.size()= 1
        // t1.size()= 0
    }


 

> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13007718#comment-13007718 ] 

Andy Seaborne commented on JENA-41:
-----------------------------------

See design note: https://cwiki.apache.org/confluence/display/JENA/TxTDB

> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Issue Comment Edited: (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13008009#comment-13008009 ] 

Stephen Allen edited comment on JENA-41 at 3/17/11 5:20 PM:
------------------------------------------------------------

I have been working on adding transactions to Parliament (using a WAL for durability but with MVCC to provide nonblocking readers).  I've been thinking about the interface between ARQ and graph stores and have attached some code and interfaces I have come up with.

I see two main ways to modify the DatasetGraph interface to provide transactions.

1)  Add explicit transaction support to the interface (see TransactionalDataGraph.java).  This means breaking interface changes or a second codepath to deal with these new DataGraphs.  But it has the benefit of cleaner syntax if a single thread is using more than one transaction (some uses for this capability would be: a) eliminate memory/disk buffers in the SPARQL Update code by obtaining sequential transactions for stores that supported snapshot isolation; b) simplify transaction handling if queries were changed not to run on a single thread (either parallel execution or NIO-style asynchronous execution).  See [1] for sample usage with multiple transactions.

2)  Tie transaction information to the current thread (see TransactionHandler.java).  This works well with the current ARQ query execution but becomes more unwieldy when working with multiple transactions on a single thread.  See [2] for example usage.



[1] 
   public void MultipleTransactions1()
    {
        TransactionalDatasetGraph tdsg = ...
        Quad quadToAdd = ...
        
        // Omitted error handling for clarity

        // A read-only transaction
        Transaction t1 = tdsg.getTransactionManager().begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_ONLY);
        
        // A read-write transaction
        Transaction t2 = tdsg.getTransactionManager().begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_WRITE);
        
        System.out.println("t1.size()= " + tdsg.size(t1));
        
        System.out.println("Adding statement in t2");
        tdsg.add(quadToAdd, t2);
        System.out.println("t2.size()= " + tdsg.size(t1));
        t2.commit();
        
        System.out.println("t1.size()= " + tdsg.size(t1));
        t1.commit();
        
        // Output
        // ------
        // t1.size()= 0
        // Adding statement in t2
        // t2.size()= 1
        // t1.size()= 0
    }

[2] 
   public void MultipleTransactions2()
    {
        // Error handling omitted for clarity
        
        DatasetGraph dsg = ...
        Quad quadToAdd = ...
        
        TransactionHandler th = dsg.getTransactionHandler();
        
        // A read-only transaction
        th.begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_ONLY);
        TransactionHandle t1 = th.suspend();
        
        // A read-write transaction
        th.begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_WRITE);
        TransactionHandle t2 = th.suspend();
        
        th.resume(t1);
        System.out.println("t1.size()= " + dsg.size());
        
        t1 = th.suspend();
        th.resume(t2);
        System.out.println("Adding statement in t2");
        dsg.add(quadToAdd);
        System.out.println("t2.size()= " + dsg.size());
        th.commit();
        
        th.resume(t1);
        System.out.println("t1.size()= " + dsg.size());
        th.commit();
        
        // Output
        // ------
        // t1.size()= 0
        // Adding statement in t2
        // t2.size()= 1
        // t1.size()= 0
    }


 

      was (Author: sallen):
    I have been working on adding transactions to Parliament (using a WAL for durability but with MVCC to provide nonblocking readers).  I've been thinking about the interface between ARQ and graph stores and have attached some code and interfaces I have come up with.

I see two main ways to modify the DatasetGraph interface to provide transactions.

1)  Add explicit transaction support to the interface (see TransactionalDataGraph.java).  This means breaking interface changes or a second codepath to deal with these new DataGraphs.  But it has the benefit of cleaner syntax if a single thread is using more than one transaction (some uses for this capability would be: a) eliminate memory/disk buffers in the SPARQL Update code by obtaining sequential transactions for stores that supported snapshot isolation; b) simplify transaction handling if queries were changed not to run on a single thread (either parallel execution or NIO-style asynchronous execution).  See [1] for sample usage with multiple transactions.

2)  Tie transaction information to the current thread (see TransactionHandler.java).  This works well with the current ARQ query execution but becomes more unwieldy when working with multiple transactions on a single thread.  See [2] for example usage.



[1] 
   public void MultipleTransactions1()
    {
        TransactionalDatasetGraph tdsg = ...
        Quad quadToAdd = ...
        
        // Omitted error handling for clarity

        // A read-only transaction
        Transaction t1 = tdsg.getTransactionManager().begin(IsolationLevel.READ_UNCOMMITTED, AccessMode.READ_ONLY);
        
        // A read-write transaction
        Transaction t2 = tdsg.getTransactionManager().begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_WRITE);
        
        System.out.println("t1.size()= " + tdsg.size(t1));
        
        System.out.println("Adding statement in t2");
        tdsg.add(quadToAdd, t2);
        System.out.println("t2.size()= " + tdsg.size(t1));
        t2.commit();
        
        System.out.println("t1.size()= " + tdsg.size(t1));
        t1.commit();
        
        // Output
        // ------
        // t1.size()= 0
        // Adding statement in t2
        // t2.size()= 1
        // t1.size()= 0
    }

[2] 
   public void MultipleTransactions2()
    {
        // Error handling omitted for clarity
        
        DatasetGraph dsg = ...
        Quad quadToAdd = ...
        
        TransactionHandler th = dsg.getTransactionHandler();
        
        // A read-only transaction
        th.begin(IsolationLevel.READ_UNCOMMITTED, AccessMode.READ_ONLY);
        TransactionHandle t1 = th.suspend();
        
        // A read-write transaction
        th.begin(IsolationLevel.SERIALIZABLE, AccessMode.READ_WRITE);
        TransactionHandle t2 = th.suspend();
        
        th.resume(t1);
        System.out.println("t1.size()= " + dsg.size());
        
        t1 = th.suspend();
        th.resume(t2);
        System.out.println("Adding statement in t2");
        dsg.add(quadToAdd);
        System.out.println("t2.size()= " + dsg.size());
        th.commit();
        
        th.resume(t1);
        System.out.println("t1.size()= " + dsg.size());
        th.commit();
        
        // Output
        // ------
        // t1.size()= 0
        // Adding statement in t2
        // t2.size()= 1
        // t1.size()= 0
    }


 
  
> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Updated: (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Stephen Allen updated JENA-41:
------------------------------

    Attachment: TransactionManagerBase.java
                TransactionManager.java
                TransactionHandler.java
                TransactionHandle.java
                TransactionalDatasetGraph.java
                Transaction.java

Classes and interfaces to add Transaction support to DatasetGraphs.

> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] Commented: (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Paolo Castagna (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12992030#comment-12992030 ] 

Paolo Castagna commented on JENA-41:
------------------------------------

How would be the format of the journal log? Is it a B+Tree index as the normal TDB indexes? If so, I am not clear on how they would be merged together and how to ensure no ids clashes.
Once I have understood how the format of the journal log, do you have a suggestion on where would be best to "intercept" an update in the TDB code to write it into the journal log rather than the usual path?
I am not asking for a big explanation, just a couple of paragraphs to start with and make sure I am going in the right direction.

> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009191#comment-13009191 ] 

Andy Seaborne commented on JENA-41:
-----------------------------------

I've been wondering if the DatasetGraph interface itself is the interface to transactions.  It does behave like a connection to the database anyway.

A call to:
  TransactionManager.transactionBegin(DatasetGraph, READ or WRITE) 
returns another DatasetGraph that can be used and
  TransactionManager.transactionCommit(DatasetGraph) 

This doesn't couple transactions directly to threads although I'm still assuming that each transaction is itself single threaded operations (no parallel writes inside a transaction).  The returned DatasetGraph can have all the machinery for transaction control and can be passed to library code that is transaction-unaware.

For isolation levels, I think the only one that makes sense in the RDF world is SERIALIZABLE because although the weaker levels guard the database, units of the triple aren't the applications view of the world where a row in a table might be a single application concept.


> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Stephen Allen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13009433#comment-13009433 ] 

Stephen Allen edited comment on JENA-41 at 3/21/11 10:14 PM:
-------------------------------------------------------------

I think your idea about the DatasetGraph being the interface for transactions makes sense.  Transactional DatasetGraphs could also provide fallback behavior for legacy code by implementing autocommit transactions if the user called methods on a dataset that was not initialized in a transactionBegin() call.


With regard to the isolation levels, I believe some of the lower levels can make sense for particular applications or queries.  For example say you want to know the size of a few of graphs.

BEGIN READ_ONLY;
select count (*) where { graph <http://example/g1> { ?s ?p ?o . } } ;
select count (*) where { graph <http://example/g2> { ?s ?p ?o . } } ;
COMMIT;

Assuming a traditional pessimistic locking scheme, running the transaction at SERIALIZABLE could cause the locks held by the first select query to also be held through the second query, reducing concurrency (using two transactions instead might not be a good idea as there is usually some amount of overhead associated with creating and committing transactions).

If you were OK with the possibility that the two query results are not truly serializable with respect to each other, then you could improve concurrency by using a READ_COMMITTED isolation level instead that would give serializable results for each query (but not the whole transaction).  And if you really just needed a rough estimate of size, using READ_UNCOMMITTED may be able to avoid locking all together.

An additional motivating factor for MVCC implementations is that they may be implementing snapshot isolation, which probably maps better to REPEATABLE_READ than SERIALIZABLE (especially if it could do predicate locking for true serializable behavior but allow cheaper snapshot isolation if that was all that was needed).  The Postgres documentation does a good job of describing this [1].

I would find it useful to have multiple isolation levels available (even if internally I'm mapping them all to SERIALIZABLE at first).  The four ANSI Isolation levels seem appropriate, and remember that implementations are allowed to map unavailable lower levels to higher levels as desired.


[1] http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html



      was (Author: sallen):
    I think your idea about the DatasetGraph being the interface for transactions makes sense.  Transactional DatasetGraphs could also provide fallback behavior for legacy code by implementing autocommit transactions if the user called methods on a dataset that was not initialized in a transactionBegin() call.


With regard to the isolation levels, I believe some of the lower levels can make sense for particular applications or queries.  For example say you want to know the size of a few of graphs.

BEGIN READ_ONLY;
select count (*) where { graph <http://example/g1> { ?s ?p ?o . } } ;
select count (*) where { graph <http://example/g2> { ?s ?p ?o . } } ;
COMMIT;

Assuming a traditional pessimistic locking scheme, running the transaction at SERIALIZABLE could cause the locks held by the first select query to also be held through the second query, reducing concurrency (using two transactions instead might not be a good idea as there is usually some amount of overhead associated with creating and committing transactions).

If you were OK with the possibility that the two query results are not truly serializable with respect to each other, then you could improve concurrency by using a READ_COMMITTED isolation level instead that would give serializable results for each query (but not the whole transaction).  And if you really just needed a rough estimate of size, using READ_UNCOMMITTED may be able to avoid locking all together.

An additional motivating factor for MVCC implementations is that they may be implementing snapshot isolation, which probably maps better to READ_COMMITTED than SERIALIZABLE (especially if it could do predicate locking for true serializable behavior but allow cheaper snapshot isolation if that was all that was needed).  The Postgres documentation does a good job of describing this [1].

I would find it useful to have multiple isolation levels available (even if internally I'm mapping them all to SERIALIZABLE at first).  The four ANSI Isolation levels seem appropriate, and remember that implementations are allowed to map unavailable lower levels to higher levels as desired.


[1] http://developer.postgresql.org/pgdocs/postgres/transaction-iso.html


  
> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Assigned] (JENA-41) Different policy for concurrency access in TDB supporting a single writer and multiple readers

Posted by "Andy Seaborne (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/JENA-41?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Andy Seaborne reassigned JENA-41:
---------------------------------

    Assignee: Andy Seaborne

> Different policy for concurrency access in TDB supporting a single writer and multiple readers
> ----------------------------------------------------------------------------------------------
>
>                 Key: JENA-41
>                 URL: https://issues.apache.org/jira/browse/JENA-41
>             Project: Jena
>          Issue Type: New Feature
>          Components: Fuseki, TDB
>            Reporter: Paolo Castagna
>            Assignee: Andy Seaborne
>         Attachments: Transaction.java, TransactionHandle.java, TransactionHandler.java, TransactionManager.java, TransactionManagerBase.java, TransactionalDatasetGraph.java
>
>
> As a follow up to a discussion about "Concurrent updates in TDB" [1] on the jena-users mailing list, I am creating this as a new feature request.
> Currently TDB requires developers to use a Multiple Reader or Single Writer (MRSW) locking policy for concurrency access [2]. Not doing this could cause data corruptions.
> The MRSW is indeed a MR xor SW (i.e. while a writer has a lock, no readers are allowed and, similarly, if a reader has a lock, no writes are possible).
> This works fine in most of the situation, but there might be problems in presence of long writes or long reads.
> It has been suggested that a "journaled file access" could be used to solve the issue regarding a long write blocking reads.
>  [1] http://markmail.org/message/jnqm6pn32df4wgte
>  [2] http://openjena.org/wiki/TDB/JavaAPI#Concurrency

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira