You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2017/11/03 11:47:01 UTC
TDB details - write transactions.
This is a long message that explains the changes in TDB2 around the way
write transactions work.
TDB2 transactions are completely different to TDB1 transactions. The
transaction coordinator is general purpose and works on a set of
transaction components, each index is a separate component. In TDB1,
the transaction manager works on the TDB1 database.
** TDB1
In TDB1, a write transaction creates a number of changes to be made to
the database. These are stored in the journal. They consist of
replacement blocks (i.e overwrite) and new blocks for the indexes. All
later transactions (after the W commits) use the in-memory cache of the
journal and the main database.
The Node changes are written ahead to the node storage which is
append-only so don't need recording in the journal. They are
inaccessible to earlier transactions as they are unreferenced via the
node table indexes.
The journal needs to be written to the main database. TBD1 is
update-in-place. TDB1 is also lock-free. Writing to the main index
requires that there are no other transactions using the database. If
there are other active transactions, the work is not done but queued.
This queue is checked whenever a transaction, read or write, finishes.
If at that point, the transaction is the only one active, TDB1 writes
the journal to the main database, and clears the journal. That
transaction can be a reader - work done to write-back is incurred by the
reader.
This is the delayed replay queue. (Replay because it's a log-ahead
system and writing back the journal is replaying changes.) Write
transaction changes are always delayed for efficiency to amortize the
overhead of the costs of write-back.
There will be be layers : writers running with more changes to the
database still in the delayed replay queue yet these may be in=-use by
readers. A new layer is added for the new writer.
Under load, the delayed replay queue grows. There isn't a moment to
write back the changes to the main database.
There are couple of mechanisms to catch this - if the queue is over a
certain length, or the total size of the journal is over a threshold,
TDB1 holds back transactions as they begin, waits for the current ones
to finish, then writes the queue.
** TDB2
In TDB2, data structures are "append-only" in that once written and
committed they are never changed. New data is written to new blocks,
and the root of the tree changes (in the case of the B+Trees -
copy-on-write, also call "persistent datastructures", where 'persistent'
is not related to external storage - different branch of computer
science using the same word with a different meaning) or the visible
length of the file changes (append-only .dat files).
The only use of the journal is to transactionally manage small control
data such as the block id of the new tree root. A transaction is less
than a disk block.
Compared to TDB1, TDB2:
+ writers change to the database as the writer proceeds.
Write efficiency: They go directly to the databases,so only one write,
not two, once to journal, once to database, and they get write-buffered
by the operating system with all the usual efficiency the OS can provide
in disk scheduling.
This improves bulk loading to the point where tdb2.tdbloader isn't doing
low level file manipulation but this a simple write to database. If low
level manipulation is an an improvement, it can fit there.
No variable size heap cache: Large inserts and deletes go to a live
database can be any size. There is no caching of the old-style journal
that depends on the size of the changes. No more running out of heap
with a large transaction.
+ Readers only read
A read transaction does not need to do anything about the delayed replay
queue. Readers just read the database, never write.
Predictable read performance.
Of course, there is a downside.
The database grows faster and needs compaction when the
People will start asking why the database is so large. They ask about
TDB1 and TDB2 databases will be bigger.
Maintaining compact databases while the system runs has costs, depending
on how it is done. e.g. it's slower - with some kind of incremental
maintenance overhead (disk/SDD I/O); transaction performance less
predicable; (very) complicated locking schemes, including system aborts
when the DB detects a deadlock (and bugs because its complicated); large
writes impact concurrent readers much more.
TDB1 and TDB2 don't system-abort due to deadlock.
Other: TDB2 transaction coordinator is general, not TDB2 specific so it
will be able to include text indexes in the future.
** TDB3
An experiment, not part of Jena. Currently, it's working and not bad.
Bulk loads are slower at 100m but the promise is that large loads
(billion triple range) are better. As an experiment, it may not be a
good idea - and will make slow progress. There are no releases and none
planned.
TDB3 uses RocksDB -- http://rocksdb.org/.
That means using SSTables, not CoW B+Trees. At the moment, one single
SSTable for everything because the storage data can be partitioned so no
need to have several RocksDB databases.
Still needs compaction. That's a innate feature of SSTable and LSM (Log
Structured Merge) systems.
It also based on work (RocksDB PR#1298) by Adam Retter to expose the
RocksDB transaction system to java.
https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats
Andy
Re: TDB details - write transactions.
Posted by Jean-Marc Vanel <je...@gmail.com>.
The answer is in the docs linked to Jena 3.5 announce:
http://jena.apache.org/documentation/tdb2/tdb2_admin.html#compaction
2017-11-03 13:08 GMT+01:00 zPlus <zp...@peers.community>:
> Does Jena have a way to compact TDB2 databases, maybe with some CLI
> tool to run manually? Or do TDB2 databases just grow indefinitely?
>
> ----- Original Message -----
> From: users@jena.apache.org
> To:<us...@jena.apache.org>
> Cc:
> Sent:Fri, 3 Nov 2017 11:47:01 +0000
> Subject:TDB details - write transactions.
>
> This is a long message that explains the changes in TDB2 around the
> way
> write transactions work.
>
> TDB2 transactions are completely different to TDB1 transactions. The
> transaction coordinator is general purpose and works on a set of
> transaction components, each index is a separate component. In TDB1,
> the transaction manager works on the TDB1 database.
>
> ** TDB1
>
> In TDB1, a write transaction creates a number of changes to be made
> to
> the database. These are stored in the journal. They consist of
> replacement blocks (i.e overwrite) and new blocks for the indexes.
> All
> later transactions (after the W commits) use the in-memory cache of
> the
> journal and the main database.
>
> The Node changes are written ahead to the node storage which is
> append-only so don't need recording in the journal. They are
> inaccessible to earlier transactions as they are unreferenced via the
>
> node table indexes.
>
> The journal needs to be written to the main database. TBD1 is
> update-in-place. TDB1 is also lock-free. Writing to the main index
> requires that there are no other transactions using the database. If
> there are other active transactions, the work is not done but queued.
>
> This queue is checked whenever a transaction, read or write,
> finishes.
> If at that point, the transaction is the only one active, TDB1 writes
>
> the journal to the main database, and clears the journal. That
> transaction can be a reader - work done to write-back is incurred by
> the
> reader.
>
> This is the delayed replay queue. (Replay because it's a log-ahead
> system and writing back the journal is replaying changes.) Write
> transaction changes are always delayed for efficiency to amortize the
>
> overhead of the costs of write-back.
>
> There will be be layers : writers running with more changes to the
> database still in the delayed replay queue yet these may be in=-use
> by
> readers. A new layer is added for the new writer.
>
> Under load, the delayed replay queue grows. There isn't a moment to
> write back the changes to the main database.
>
> There are couple of mechanisms to catch this - if the queue is over a
>
> certain length, or the total size of the journal is over a threshold,
>
> TDB1 holds back transactions as they begin, waits for the current
> ones
> to finish, then writes the queue.
>
> ** TDB2
>
> In TDB2, data structures are "append-only" in that once written and
> committed they are never changed. New data is written to new blocks,
> and the root of the tree changes (in the case of the B+Trees -
> copy-on-write, also call "persistent datastructures", where
> 'persistent'
> is not related to external storage - different branch of computer
> science using the same word with a different meaning) or the visible
> length of the file changes (append-only .dat files).
>
> The only use of the journal is to transactionally manage small
> control
> data such as the block id of the new tree root. A transaction is less
>
> than a disk block.
>
> Compared to TDB1, TDB2:
>
> + writers change to the database as the writer proceeds.
>
> Write efficiency: They go directly to the databases,so only one
> write,
> not two, once to journal, once to database, and they get
> write-buffered
> by the operating system with all the usual efficiency the OS can
> provide
> in disk scheduling.
>
> This improves bulk loading to the point where tdb2.tdbloader isn't
> doing
> low level file manipulation but this a simple write to database. If
> low
> level manipulation is an an improvement, it can fit there.
>
> No variable size heap cache: Large inserts and deletes go to a live
> database can be any size. There is no caching of the old-style
> journal
> that depends on the size of the changes. No more running out of heap
> with a large transaction.
>
> + Readers only read
>
> A read transaction does not need to do anything about the delayed
> replay
> queue. Readers just read the database, never write.
>
> Predictable read performance.
>
> Of course, there is a downside.
>
> The database grows faster and needs compaction when the
>
> People will start asking why the database is so large. They ask about
>
> TDB1 and TDB2 databases will be bigger.
>
> Maintaining compact databases while the system runs has costs,
> depending
> on how it is done. e.g. it's slower - with some kind of incremental
> maintenance overhead (disk/SDD I/O); transaction performance less
> predicable; (very) complicated locking schemes, including system
> aborts
> when the DB detects a deadlock (and bugs because its complicated);
> large
> writes impact concurrent readers much more.
>
> TDB1 and TDB2 don't system-abort due to deadlock.
>
> Other: TDB2 transaction coordinator is general, not TDB2 specific so
> it
> will be able to include text indexes in the future.
>
> ** TDB3
>
> An experiment, not part of Jena. Currently, it's working and not bad.
>
> Bulk loads are slower at 100m but the promise is that large loads
> (billion triple range) are better. As an experiment, it may not be a
> good idea - and will make slow progress. There are no releases and
> none
> planned.
>
> TDB3 uses RocksDB -- http://rocksdb.org/.
>
> That means using SSTables, not CoW B+Trees. At the moment, one single
>
> SSTable for everything because the storage data can be partitioned so
> no
> need to have several RocksDB databases.
>
> Still needs compaction. That's a innate feature of SSTable and LSM
> (Log
> Structured Merge) systems.
>
> It also based on work (RocksDB PR#1298) by Adam Retter to expose the
> RocksDB transaction system to java.
>
> https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-
> RocksDB-SST-formats
>
> Andy
>
>
--
Jean-Marc Vanel
http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me#subject
<http://www.semantic-forms.cc:9111/display?displayuri=http://jmvanel.free.fr/jmv.rdf%23me>
Déductions SARL - Consulting, services, training,
Rule-based programming, Semantic Web
+33 (0)6 89 16 29 52
Twitter: @jmvanel , @jmvanel_fr ; chat: irc://irc.freenode.net#eulergui
Re: TDB details - write transactions.
Posted by zPlus <zp...@peers.community>.
Does Jena have a way to compact TDB2 databases, maybe with some CLI
tool to run manually? Or do TDB2 databases just grow indefinitely?
----- Original Message -----
From: users@jena.apache.org
To:<us...@jena.apache.org>
Cc:
Sent:Fri, 3 Nov 2017 11:47:01 +0000
Subject:TDB details - write transactions.
This is a long message that explains the changes in TDB2 around the
way
write transactions work.
TDB2 transactions are completely different to TDB1 transactions. The
transaction coordinator is general purpose and works on a set of
transaction components, each index is a separate component. In TDB1,
the transaction manager works on the TDB1 database.
** TDB1
In TDB1, a write transaction creates a number of changes to be made
to
the database. These are stored in the journal. They consist of
replacement blocks (i.e overwrite) and new blocks for the indexes.
All
later transactions (after the W commits) use the in-memory cache of
the
journal and the main database.
The Node changes are written ahead to the node storage which is
append-only so don't need recording in the journal. They are
inaccessible to earlier transactions as they are unreferenced via the
node table indexes.
The journal needs to be written to the main database. TBD1 is
update-in-place. TDB1 is also lock-free. Writing to the main index
requires that there are no other transactions using the database. If
there are other active transactions, the work is not done but queued.
This queue is checked whenever a transaction, read or write,
finishes.
If at that point, the transaction is the only one active, TDB1 writes
the journal to the main database, and clears the journal. That
transaction can be a reader - work done to write-back is incurred by
the
reader.
This is the delayed replay queue. (Replay because it's a log-ahead
system and writing back the journal is replaying changes.) Write
transaction changes are always delayed for efficiency to amortize the
overhead of the costs of write-back.
There will be be layers : writers running with more changes to the
database still in the delayed replay queue yet these may be in=-use
by
readers. A new layer is added for the new writer.
Under load, the delayed replay queue grows. There isn't a moment to
write back the changes to the main database.
There are couple of mechanisms to catch this - if the queue is over a
certain length, or the total size of the journal is over a threshold,
TDB1 holds back transactions as they begin, waits for the current
ones
to finish, then writes the queue.
** TDB2
In TDB2, data structures are "append-only" in that once written and
committed they are never changed. New data is written to new blocks,
and the root of the tree changes (in the case of the B+Trees -
copy-on-write, also call "persistent datastructures", where
'persistent'
is not related to external storage - different branch of computer
science using the same word with a different meaning) or the visible
length of the file changes (append-only .dat files).
The only use of the journal is to transactionally manage small
control
data such as the block id of the new tree root. A transaction is less
than a disk block.
Compared to TDB1, TDB2:
+ writers change to the database as the writer proceeds.
Write efficiency: They go directly to the databases,so only one
write,
not two, once to journal, once to database, and they get
write-buffered
by the operating system with all the usual efficiency the OS can
provide
in disk scheduling.
This improves bulk loading to the point where tdb2.tdbloader isn't
doing
low level file manipulation but this a simple write to database. If
low
level manipulation is an an improvement, it can fit there.
No variable size heap cache: Large inserts and deletes go to a live
database can be any size. There is no caching of the old-style
journal
that depends on the size of the changes. No more running out of heap
with a large transaction.
+ Readers only read
A read transaction does not need to do anything about the delayed
replay
queue. Readers just read the database, never write.
Predictable read performance.
Of course, there is a downside.
The database grows faster and needs compaction when the
People will start asking why the database is so large. They ask about
TDB1 and TDB2 databases will be bigger.
Maintaining compact databases while the system runs has costs,
depending
on how it is done. e.g. it's slower - with some kind of incremental
maintenance overhead (disk/SDD I/O); transaction performance less
predicable; (very) complicated locking schemes, including system
aborts
when the DB detects a deadlock (and bugs because its complicated);
large
writes impact concurrent readers much more.
TDB1 and TDB2 don't system-abort due to deadlock.
Other: TDB2 transaction coordinator is general, not TDB2 specific so
it
will be able to include text indexes in the future.
** TDB3
An experiment, not part of Jena. Currently, it's working and not bad.
Bulk loads are slower at 100m but the promise is that large loads
(billion triple range) are better. As an experiment, it may not be a
good idea - and will make slow progress. There are no releases and
none
planned.
TDB3 uses RocksDB -- http://rocksdb.org/.
That means using SSTables, not CoW B+Trees. At the moment, one single
SSTable for everything because the storage data can be partitioned so
no
need to have several RocksDB databases.
Still needs compaction. That's a innate feature of SSTable and LSM
(Log
Structured Merge) systems.
It also based on work (RocksDB PR#1298) by Adam Retter to expose the
RocksDB transaction system to java.
https://github.com/facebook/rocksdb/wiki/A-Tutorial-of-RocksDB-SST-formats
Andy