You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2017/11/03 11:47:01 UTC

TDB details - write transactions.

This is a long message that explains the changes in TDB2 around the way 
write transactions work.

TDB2 transactions are completely different to TDB1 transactions. The 
transaction coordinator is general purpose and works on a set of 
transaction components, each index is a separate component.  In TDB1, 
the transaction manager works on the TDB1 database.

** TDB1

In TDB1, a write transaction creates a number of changes to be made to 
the database. These are stored in the journal.  They consist of 
replacement blocks (i.e overwrite) and new blocks for the indexes.  All 
later transactions (after the W commits) use the in-memory cache of the 
journal and the main database.

The Node changes are written ahead to the node storage which is 
append-only so don't need recording in the journal. They are 
inaccessible to earlier transactions as they are unreferenced via the 
node table indexes.

The journal needs to be written to the main database. TBD1 is 
update-in-place. TDB1 is also lock-free. Writing to the main index 
requires that there are no other transactions using the database.  If 
there are other active transactions, the work is not done but queued.

This queue is checked whenever a transaction, read or write, finishes. 
If at that point, the transaction is the only one active, TDB1 writes 
the journal to the main database, and clears the journal.  That 
transaction can be a reader - work done to write-back is incurred by the 
reader.

This is the delayed replay queue. (Replay because it's a log-ahead 
system and writing back the journal is replaying changes.)  Write 
transaction changes are always delayed for efficiency to amortize the 
overhead of the costs of write-back.

There will be be layers : writers running with more changes to the 
database still in the delayed replay queue yet these may be in=-use by 
readers. A new layer is added for the new writer.

Under load, the delayed replay queue grows. There isn't a moment to 
write back the changes to the main database.

There are  couple of mechanisms to catch this - if the queue is over a 
certain length, or the total size of the journal is over a threshold, 
TDB1 holds back transactions as they begin, waits for the current ones 
to finish, then writes the queue.

** TDB2

In TDB2, data structures are "append-only" in that once written and 
committed they are never changed.  New data is written to new blocks, 
and the root of the tree changes (in the case of the B+Trees - 
copy-on-write, also call "persistent datastructures", where 'persistent' 
is not related to external storage - different branch of computer 
science using the same word with a different meaning) or the visible 
length of the file changes (append-only .dat files).

The only use of the journal is to transactionally manage small control 
data such as the block id of the new tree root.  A transaction is less 
than a disk block.

Compared to TDB1, TDB2:

+ writers change to the database as the writer proceeds.

Write efficiency: They go directly to the databases,so only one write, 
not two, once to journal, once to database, and they get write-buffered 
by the operating system with all the usual efficiency the OS can provide 
in disk scheduling.

This improves bulk loading to the point where tdb2.tdbloader isn't doing 
low level file manipulation but this a simple write to database.  If low 
level manipulation is an an improvement, it can fit there.

No variable size heap cache: Large inserts and deletes go to a live 
database can be any size.  There is no caching of the old-style journal 
that depends on the size of the changes. No more running out of heap 
with a large transaction.

+ Readers only read

A read transaction does not need to do anything about the delayed replay 
queue.  Readers just read the database, never write.

Predictable read performance.

Of course, there is a downside.

The database grows faster and needs compaction when the

People will start asking why the database is so large.  They ask about 
TDB1 and TDB2 databases will be bigger.

Maintaining compact databases while the system runs has costs, depending 
on how it is done. e.g. it's slower - with some kind of incremental 
maintenance overhead (disk/SDD I/O); transaction performance less 
predicable; (very) complicated locking schemes, including system aborts 
when the DB detects a deadlock (and bugs because its complicated); large 
writes impact concurrent readers much more.

TDB1 and TDB2 don't system-abort due to deadlock.

Other: TDB2 transaction coordinator is general, not TDB2 specific so it 
will be able to include text indexes in the future.

** TDB3

An experiment, not part of Jena. Currently, it's working and not bad. 
Bulk loads are slower at 100m but the promise is that large loads 
(billion triple range) are better. As an experiment, it may not be a 
good idea - and will make slow progress.  There are no releases and none 
planned.

TDB3 uses RocksDB -- http://rocksdb.org/.

That means using SSTables, not CoW B+Trees. At the moment, one single 
SSTable for everything because the storage data can be partitioned so no 
need to have several RocksDB databases.

Still needs compaction. That's a innate feature of SSTable and LSM (Log 
Structured Merge) systems.

It also based on work (RocksDB PR#1298) by Adam Retter to expose the 
RocksDB transaction system to java.

https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats

     Andy

Re: TDB details - write transactions.

Posted by Jean-Marc Vanel <je...@gmail.com>.
The answer is in the docs  linked to Jena 3.5 announce:
http://jena.apache.org/documentation/tdb2/tdb2_admin.html#compaction

2017-11-03 13:08 GMT+01:00 zPlus <zp...@peers.community>:

> Does Jena have a way to compact TDB2 databases, maybe with some CLI
> tool to run manually? Or do TDB2 databases just grow indefinitely?
>
> ----- Original Message -----
> From: users@jena.apache.org
> To:<us...@jena.apache.org>
> Cc:
> Sent:Fri, 3 Nov 2017 11:47:01 +0000
> Subject:TDB details - write transactions.
>
>  This is a long message that explains the changes in TDB2 around the
> way
>  write transactions work.
>
>  TDB2 transactions are completely different to TDB1 transactions. The
>  transaction coordinator is general purpose and works on a set of
>  transaction components, each index is a separate component. In TDB1,
>  the transaction manager works on the TDB1 database.
>
>  ** TDB1
>
>  In TDB1, a write transaction creates a number of changes to be made
> to
>  the database. These are stored in the journal. They consist of
>  replacement blocks (i.e overwrite) and new blocks for the indexes.
> All
>  later transactions (after the W commits) use the in-memory cache of
> the
>  journal and the main database.
>
>  The Node changes are written ahead to the node storage which is
>  append-only so don't need recording in the journal. They are
>  inaccessible to earlier transactions as they are unreferenced via the
>
>  node table indexes.
>
>  The journal needs to be written to the main database. TBD1 is
>  update-in-place. TDB1 is also lock-free. Writing to the main index
>  requires that there are no other transactions using the database. If
>  there are other active transactions, the work is not done but queued.
>
>  This queue is checked whenever a transaction, read or write,
> finishes.
>  If at that point, the transaction is the only one active, TDB1 writes
>
>  the journal to the main database, and clears the journal. That
>  transaction can be a reader - work done to write-back is incurred by
> the
>  reader.
>
>  This is the delayed replay queue. (Replay because it's a log-ahead
>  system and writing back the journal is replaying changes.) Write
>  transaction changes are always delayed for efficiency to amortize the
>
>  overhead of the costs of write-back.
>
>  There will be be layers : writers running with more changes to the
>  database still in the delayed replay queue yet these may be in=-use
> by
>  readers. A new layer is added for the new writer.
>
>  Under load, the delayed replay queue grows. There isn't a moment to
>  write back the changes to the main database.
>
>  There are couple of mechanisms to catch this - if the queue is over a
>
>  certain length, or the total size of the journal is over a threshold,
>
>  TDB1 holds back transactions as they begin, waits for the current
> ones
>  to finish, then writes the queue.
>
>  ** TDB2
>
>  In TDB2, data structures are "append-only" in that once written and
>  committed they are never changed. New data is written to new blocks,
>  and the root of the tree changes (in the case of the B+Trees -
>  copy-on-write, also call "persistent datastructures", where
> 'persistent'
>  is not related to external storage - different branch of computer
>  science using the same word with a different meaning) or the visible
>  length of the file changes (append-only .dat files).
>
>  The only use of the journal is to transactionally manage small
> control
>  data such as the block id of the new tree root. A transaction is less
>
>  than a disk block.
>
>  Compared to TDB1, TDB2:
>
>  + writers change to the database as the writer proceeds.
>
>  Write efficiency: They go directly to the databases,so only one
> write,
>  not two, once to journal, once to database, and they get
> write-buffered
>  by the operating system with all the usual efficiency the OS can
> provide
>  in disk scheduling.
>
>  This improves bulk loading to the point where tdb2.tdbloader isn't
> doing
>  low level file manipulation but this a simple write to database. If
> low
>  level manipulation is an an improvement, it can fit there.
>
>  No variable size heap cache: Large inserts and deletes go to a live
>  database can be any size. There is no caching of the old-style
> journal
>  that depends on the size of the changes. No more running out of heap
>  with a large transaction.
>
>  + Readers only read
>
>  A read transaction does not need to do anything about the delayed
> replay
>  queue. Readers just read the database, never write.
>
>  Predictable read performance.
>
>  Of course, there is a downside.
>
>  The database grows faster and needs compaction when the
>
>  People will start asking why the database is so large. They ask about
>
>  TDB1 and TDB2 databases will be bigger.
>
>  Maintaining compact databases while the system runs has costs,
> depending
>  on how it is done. e.g. it's slower - with some kind of incremental
>  maintenance overhead (disk/SDD I/O); transaction performance less
>  predicable; (very) complicated locking schemes, including system
> aborts
>  when the DB detects a deadlock (and bugs because its complicated);
> large
>  writes impact concurrent readers much more.
>
>  TDB1 and TDB2 don't system-abort due to deadlock.
>
>  Other: TDB2 transaction coordinator is general, not TDB2 specific so
> it
>  will be able to include text indexes in the future.
>
>  ** TDB3
>
>  An experiment, not part of Jena. Currently, it's working and not bad.
>
>  Bulk loads are slower at 100m but the promise is that large loads
>  (billion triple range) are better. As an experiment, it may not be a
>  good idea - and will make slow progress. There are no releases and
> none
>  planned.
>
>  TDB3 uses RocksDB -- http://rocksdb.org/.
>
>  That means using SSTables, not CoW B+Trees. At the moment, one single
>
>  SSTable for everything because the storage data can be partitioned so
> no
>  need to have several RocksDB databases.
>
>  Still needs compaction. That's a innate feature of SSTable and LSM
> (Log
>  Structured Merge) systems.
>
>  It also based on work (RocksDB PR#1298) by Adam Retter to expose the
>  RocksDB transaction system to java.
>
>  https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-
> RocksDB-SST-formats
>
>  Andy
>
>


-- 
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me#subject
<http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui

Re: TDB details - write transactions.

Posted by zPlus <zp...@peers.community>.
Does Jena have a way to compact TDB2 databases, maybe with some CLI
tool to run manually? Or do TDB2 databases just grow indefinitely?

----- Original Message -----
From: users@jena.apache.org
To:<us...@jena.apache.org>
Cc:
Sent:Fri, 3 Nov 2017 11:47:01 +0000
Subject:TDB details - write transactions.

 This is a long message that explains the changes in TDB2 around the
way 
 write transactions work.

 TDB2 transactions are completely different to TDB1 transactions. The 
 transaction coordinator is general purpose and works on a set of 
 transaction components, each index is a separate component. In TDB1, 
 the transaction manager works on the TDB1 database.

 ** TDB1

 In TDB1, a write transaction creates a number of changes to be made
to 
 the database. These are stored in the journal. They consist of 
 replacement blocks (i.e overwrite) and new blocks for the indexes.
All 
 later transactions (after the W commits) use the in-memory cache of
the 
 journal and the main database.

 The Node changes are written ahead to the node storage which is 
 append-only so don't need recording in the journal. They are 
 inaccessible to earlier transactions as they are unreferenced via the

 node table indexes.

 The journal needs to be written to the main database. TBD1 is 
 update-in-place. TDB1 is also lock-free. Writing to the main index 
 requires that there are no other transactions using the database. If 
 there are other active transactions, the work is not done but queued.

 This queue is checked whenever a transaction, read or write,
finishes. 
 If at that point, the transaction is the only one active, TDB1 writes

 the journal to the main database, and clears the journal. That 
 transaction can be a reader - work done to write-back is incurred by
the 
 reader.

 This is the delayed replay queue. (Replay because it's a log-ahead 
 system and writing back the journal is replaying changes.) Write 
 transaction changes are always delayed for efficiency to amortize the

 overhead of the costs of write-back.

 There will be be layers : writers running with more changes to the 
 database still in the delayed replay queue yet these may be in=-use
by 
 readers. A new layer is added for the new writer.

 Under load, the delayed replay queue grows. There isn't a moment to 
 write back the changes to the main database.

 There are couple of mechanisms to catch this - if the queue is over a

 certain length, or the total size of the journal is over a threshold,

 TDB1 holds back transactions as they begin, waits for the current
ones 
 to finish, then writes the queue.

 ** TDB2

 In TDB2, data structures are "append-only" in that once written and 
 committed they are never changed. New data is written to new blocks, 
 and the root of the tree changes (in the case of the B+Trees - 
 copy-on-write, also call "persistent datastructures", where
'persistent' 
 is not related to external storage - different branch of computer 
 science using the same word with a different meaning) or the visible 
 length of the file changes (append-only .dat files).

 The only use of the journal is to transactionally manage small
control 
 data such as the block id of the new tree root. A transaction is less

 than a disk block.

 Compared to TDB1, TDB2:

 + writers change to the database as the writer proceeds.

 Write efficiency: They go directly to the databases,so only one
write, 
 not two, once to journal, once to database, and they get
write-buffered 
 by the operating system with all the usual efficiency the OS can
provide 
 in disk scheduling.

 This improves bulk loading to the point where tdb2.tdbloader isn't
doing 
 low level file manipulation but this a simple write to database. If
low 
 level manipulation is an an improvement, it can fit there.

 No variable size heap cache: Large inserts and deletes go to a live 
 database can be any size. There is no caching of the old-style
journal 
 that depends on the size of the changes. No more running out of heap 
 with a large transaction.

 + Readers only read

 A read transaction does not need to do anything about the delayed
replay 
 queue. Readers just read the database, never write.

 Predictable read performance.

 Of course, there is a downside.

 The database grows faster and needs compaction when the

 People will start asking why the database is so large. They ask about

 TDB1 and TDB2 databases will be bigger.

 Maintaining compact databases while the system runs has costs,
depending 
 on how it is done. e.g. it's slower - with some kind of incremental 
 maintenance overhead (disk/SDD I/O); transaction performance less 
 predicable; (very) complicated locking schemes, including system
aborts 
 when the DB detects a deadlock (and bugs because its complicated);
large 
 writes impact concurrent readers much more.

 TDB1 and TDB2 don't system-abort due to deadlock.

 Other: TDB2 transaction coordinator is general, not TDB2 specific so
it 
 will be able to include text indexes in the future.

 ** TDB3

 An experiment, not part of Jena. Currently, it's working and not bad.

 Bulk loads are slower at 100m but the promise is that large loads 
 (billion triple range) are better. As an experiment, it may not be a 
 good idea - and will make slow progress. There are no releases and
none 
 planned.

 TDB3 uses RocksDB -- http://rocksdb.org/.

 That means using SSTables, not CoW B+Trees. At the moment, one single

 SSTable for everything because the storage data can be partitioned so
no 
 need to have several RocksDB databases.

 Still needs compaction. That's a innate feature of SSTable and LSM
(Log 
 Structured Merge) systems.

 It also based on work (RocksDB PR#1298) by Adam Retter to expose the 
 RocksDB transaction system to java.

 https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats

 Andy