You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by an...@apache.org on 2022/02/20 15:08:47 UTC

[jena-site] branch main updated: Update documentation about TDB1 and TDB2

This is an automated email from the ASF dual-hosted git repository.

andy pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/jena-site.git


The following commit(s) were added to refs/heads/main by this push:
     new 16bc5fe  Update documentation about TDB1 and TDB2
16bc5fe is described below

commit 16bc5feeda9e30004b5142ce50af431388533bba
Author: Andy Seaborne <an...@apache.org>
AuthorDate: Thu Jan 6 14:44:32 2022 +0000

    Update documentation about TDB1 and TDB2
---
 source/documentation/tdb/architecture.md     | 101 +++++++++++++++++++--------
 source/documentation/tdb/tdb_transactions.md |  78 ++++++++-------------
 2 files changed, 104 insertions(+), 75 deletions(-)

diff --git a/source/documentation/tdb/architecture.md b/source/documentation/tdb/architecture.md
index 930e12a..49d1dc5 100644
--- a/source/documentation/tdb/architecture.md
+++ b/source/documentation/tdb/architecture.md
@@ -2,8 +2,8 @@
 title: TDB Architecture
 ---
 
-This page gives an overview of the TDB architecture. Specific
-details refer to TDB 0.8.
+This page gives an overview of the TDB architecture.
+It applies to TDB1 and TDB2 with differences noted.
 
 ## Contents
 
@@ -13,6 +13,7 @@ details refer to TDB 0.8.
     -   [Triple and Quad indexes](#triple-and-quad-indexes)
     -   [Prefixes Table](#prefixes-table)
     -   [TDB B+Trees](#tdb-btrees)
+    -   [Transactions](#tdb-transactions)
 -   [Inline values](#inline-values)
 -   [Query Processing](#query-processing)
 -   [Caching on 32 and 64 bit Java systems](#caching-on-32-and-64-bit-java-systems)
@@ -39,8 +40,7 @@ filing system. A dataset consists of
 
 The node table stores the representation of RDF terms (except for
 inlined value - see below). It provides two mappings from Node to
-NodeId and from NodeId to Node. This is sometimes called a
-dictionary.
+NodeId and from NodeId to Node.
 
 The Node to NodeId mapping is used during data loading and when
 converting constant terms in queries from their Jena Node
@@ -88,23 +88,72 @@ or
 ### TDB B+Trees
 
 Many of the persistent data structures in TDB use a custom
-implementation of threaded
+implementation of 
 [B+Trees](http://en.wikipedia.org/wiki/B+_tree "http://en.wikipedia.org/wiki/B%2B_tree").
 The TDB implementation only provides for fixed length key and fixed
-length value. There is no use of the value part in triple indexes.
+length value. There is no use of the value part in triple and quads indexes.
 
-The threaded nature means that long scans of indexes proceeds
-without needing to traverse the branches of the tree.
+### Transactions {#tdb-transactions}
 
-See the description of index caching below.
+Both TDB1 and TDB2 provide database transactions.
+The API is described on the [Jena Transactions page](/docuemntation/txn/ "Jena Transactions").
+
+When running with transactions, TDB1 and TDB2 provide support for multiple read
+and write transactions without application involvement. There will be multiple
+readers active, and also a single writer active (referred to as "MR+SW"). TDB
+itself manages multiple writers, queuing them as necessary.
+
+To support transactions, TDB2 uses copy-on-write MVCC data structures internally.
+
+TDB1 can run non-transactionally but the application is responsible for ensuring
+that there is one writer or several readers, not both. This is referred to as
+"MRSW". Misuse of TDB1 in non-transactional mode can corrupt the database.
 
 ## Inline values
 
-Values of certain datatypes are held as part of the NodeId in the
-bottom 56 bits. The top 8 bits indicates the type - external NodeId
-or the value space.
+Values of certain datatypes are held as part of the NodeId.
+The top bit indicates whether the remaining 63 bits are a position in the stored
+RDF terms file (high bit is 0) or an encoded value (high bit 1).
+
+By storing the value, the exact lexical form is not recorded. The
+integers 01 and 1 will both be treated as the value 1.
+
+### TDB2
+
+The TDB2 encoding is as follows:
+
+* High bit (bit 63) 0 means the node is in the object table (PTR).
+* High bit (bit 63) 1, bit 62 1: double as 62 bits.
+* High bit (bit 63) 1, bit 62 0: 6 bits of type, 56 bits of value.
+ 
+If a value would not fit, it will be stored externally so there is no
+guarantee that all integers, say, are store inline.
+ 
+* Integer format: signed 56 bit number, the type field has the XSD type.
+* Derived types of integer, each with their own datatype.
+* Decimal format: 8 bits scale, 48bits of signed valued.
+* Date and DateTime
+* Boolean
+* Float
+
+In the case of xsd:double, the standard Java 64 bit format is used except that the range
+of the exponent is reduced by 2 bits.
+
+* bit  63    : sign bit
+* bits 52-62 : exponent, 11 bits, the power of 2, bias -1023.
+* bits 0-51  : mantissa (significand) 52 bits (the leading one is not stored).
 
-The value spaces handled are (TDB 0.8): xsd:decimal, xsd:integer,
+Exponents are 11 bits, with values -1022 to +1023 held as 1 to 2046 (11 bits, bias -1023)
+Exponents 0x000 and 0x7ff have a special meaning:
+
+The xsd:dateTime and xsd:date ranges cover about 8000 years from
+year zero with a precision down to 1 millisecond. Timezone
+information is retained to an accuracy of 15 minutes with special
+timezones for Z and for no explicit timezone.
+
+### TDB1
+
+The value spaces handled are: xsd:decimal, xsd:integer,
 xsd:dateTime, xsd:date and xsd:boolean. Each has its own encoding
 to fit in 56 bits. If a node falls outside of the range of values
 that can be represented in the 56 bit encoding.
@@ -114,31 +163,27 @@ year zero with a precision down to 1 millisecond. Timezone
 information is retained to an accuracy of 15 minutes with special
 timezones for Z and for no explicit timezone.
 
-By storing the value, the exact lexical form is not recorded. The
-integers 01 and 1 will both be treated as the value 1.
-
 Derived XSD datatypes are held as their base type. The exact
 datatype is not retained; the value of the RDF term is.
+An input of `xsd:int` will become `xsd:integer`.
 
 ## Query Processing
 
-TDB uses the
-[OpExecutor extension point of ARQ](TODO).
+TDB uses quad-execution rewriting SPARQL algebra `(graph...)` to blocks of quads
+where possible. It extends `OpExecutor`.
 TDB provides low level optimization of basic graph patterns using a
-[statistics based optimizer](optimizer.html "TDB/Optimizer").
+[statistics based optimizer](optimizer.html).
 
 ## Caching on 32 and 64 bit Java systems
 
-TDB runs on both 32-bit and 64-bit Java Virtual Machines. The same
-file formats are used on both systems and database files can be
-transferred between architectures (no TDB system should be running
-for the database at the time of copy). What differs is the file
-access mechanism used.
-
-TDB is faster on a 64 bit JVM because more memory is available for
-file caching.
+TDB runs on both 32-bit and 64-bit Java Virtual Machines.  A 64-bit Java Virtual
+Machine is the normal mode of use.  The same file formats are used on both
+systems and database files can be transferred between architectures (no TDB
+system should be running for the database at the time of copy). What differs is
+the file access mechanism used.
 
-The node table caches are always in the Java heap.
+The node table caches are always in the Java heap but otherwise the OS file
+system plays an important part in index caching.
 
 The file access mechanism can be set explicitly, but this is not a
 good idea for production usage, only for experimentation - see the
diff --git a/source/documentation/tdb/tdb_transactions.md b/source/documentation/tdb/tdb_transactions.md
index ba063ff..bf716b3 100644
--- a/source/documentation/tdb/tdb_transactions.md
+++ b/source/documentation/tdb/tdb_transactions.md
@@ -5,13 +5,13 @@ title: TDB Transactions
 TDB provides
 [ACID](http://en.wikipedia.org/wiki/ACID)
 transaction support through the use of
-[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging).
+[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging) in TDB1
+and copy-on-write MVCC structures in TDB2.
 
-Use of transactions protects a TDB dataset
-against data corruption, unexpected process termination and system crashes and therefore use of transactions is **strongly** recommended.
+Use of transactions protects a TDB dataset against data corruption, unexpected
+process termination and system crashes. 
 
-This feature is part of version TDB 0.9.0 and later.  Databases created with version of TDB 0.8.X can be used with 0.9.X
-to add transactional capability.
+Non-transactional use of TDB1 should be avoided; TDB2 only operates with transactions.
 
 ## Contents
 
@@ -23,18 +23,19 @@ to add transactional capability.
 -   [Multi-threaded use](#multi-threaded-use)
 -   [Bulk loading](#bulk-loading)
 -   [Multi JVM](#multi-jvm)
--   [Migration from TDB 0.8.X](#migration-from-tdb-08x)
--   [Reverting to TDB 0.8.X](#reverting-to-tdb-08x)
 
 ## Overview
 
-The transaction mechanism in TDB is based on
-[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging).
-All changes made inside a write-transaction are written to
-[journals](http://en.wikipedia.org/wiki/Journaling_file_system),
-then propagated to the main database at a suitable moment. This
-design allows for read-transactions to proceed without locking or
-other overhead over the base database.
+TDB2 uses [MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
+via a copy-on-write mechanism. Update transactions can be of any size.
+
+The TDB1 transaction mechanism is based on
+[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging).  All
+changes made inside a write-transaction are written to
+[journals](http://en.wikipedia.org/wiki/Journaling_file_system), then propagated
+to the main database at a suitable moment.  Transactions is TDB1 are limited in
+size to a few 10's of million triples because they retain data in-memory until
+indexes can be updated.
 
 Transactional TDB supports one active write transaction, and
 multiple read transactions at the same time. Read-transactions
@@ -59,23 +60,26 @@ transactions, the highest
 (some of these limitations may be removed in later versions)
 
 -   Bulk loads: the TDB bulk loader is not transactional
--   [Nested transactions](http://en.wikipedia.org/wiki/Nested_transaction) are not supported.
+-   [Nested transactions](http://en.wikipedia.org/wiki/Nested_transaction) 
+    are not supported.
+
+TDB2 remved the limitations of TDB1:
+
 -   Some active transaction state is held exclusively in-memory,
     limiting scalability.
 -   Long-running transactions. Read-transactions cause a build-up
     of pending changes;
 
 If a single read transaction runs for a long time when there are
-many updates, the system will consume a lot of temporary
+many updates, the TDB1 system will consume a lot of temporary
 resources.
 
 ## API for Transactions
 
-TDB supports the general Jena API for transactions on RDF datasets 
-(introduced in Jena 2.7.0, ARQ 2.9.0).
+Ths section uses the primitives of the transaction mechanism. 
 
-A TDB-backed dataset can be used non-transactionally but once used in a transaction, 
-it must be used transactionally after that.
+Better APIs are described in [the transaction API
+documentation](/documentation/txn/).
 
 ### Read transactions
 
@@ -218,35 +222,15 @@ same storage. in both cases, the transactions are independent.
 Multiple applications, running in multiple JVMs, using the same
 file databases is not supported and has a high risk of data corruption.  Once corrupted a database cannot be repaired
 and must be rebuilt from the original source data. Therefore there **must** be a single JVM
-controlling the database directory and files.  From 1.1.0 onwards TDB includes automatic prevention against multi-JVM
+controlling the database directory and files. TDB includes automatic prevention against multi-JVM
 which prevents this under most circumstances.
 
-Use our [Fuseki](../fuseki2/) component to provide a
-database server for multiple applications. Fuseki supports 
-[SPARQL Query](http://www.w3.org/TR/sparql11-query/),
-[SPARQL Update](http://www.w3.org/TR/sparql11-update/) and the
-[SPARQL Graph Store protocol](http://www.w3.org/TR/sparql11-http-rdf-update/).
+Use [Fuseki](../fuseki2/) to provide a database server for multiple
+applications. Fuseki supports [SPARQL
+Query](http://www.w3.org/TR/sparql11-query/), [SPARQL
+Update](http://www.w3.org/TR/sparql11-update/) and the [SPARQL Graph Store
+protocol](http://www.w3.org/TR/sparql11-http-rdf-update/).
 
 ## Bulk loading
 
-The bulk loader is not transactional.
-
-## Migration from TDB 0.8.X
-
-The database files used by TDB 0.9.0 are fully compatible with TDB
-0.8.X; there are no file format changes and application code using
-the interface provided by `TDBFactory` will continue to work as
-before, without transaction capabilities. The only addition is the
-presence of journal files.
-
-Transactions use a new API: the `TDBFactory` API is still present.
-If an application simply uses the TDB 0.9 codebase, it will work as
-before without transactions.
-
-Applications can start using transaction by coding to the new API.
-
-## Reverting to TDB 0.8.X
-
-A database can be reverted to TDB 0.8.X by running `tdb.tdbrecover`
-- this program recovers any committed transaction with pending
-actions. The database can then be used with TDB 0.8.X.
+Bulk loaders are not transactional.