You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@jena.apache.org by an...@apache.org on 2022/01/06 14:59:33 UTC

[jena-site] branch tdb-arch updated (67ba74d -> 2559fd8)

This is an automated email from the ASF dual-hosted git repository.

andy pushed a change to branch tdb-arch
in repository https://gitbox.apache.org/repos/asf/jena-site.git.


 discard 67ba74d  Update documentation about TDB1 and TDB2
     new 2559fd8  Update documentation about TDB1 and TDB2

This update added new revisions after undoing existing revisions.
That is to say, some revisions that were in the old version of the
branch are not in the new version.  This situation occurs
when a user --force pushes a change and generates a repository
containing something like this:

 * -- * -- B -- O -- O -- O   (67ba74d)
            \
             N -- N -- N   refs/heads/tdb-arch (2559fd8)

You should already have received notification emails for all of the O
revisions, and so the following emails describe only the N revisions
from the common base, B.

Any revisions marked "omit" are not gone; other references still
refer to them.  Any revisions marked "discard" are gone forever.

The 1 revisions listed above as "new" are entirely new to this
repository and will be described in separate emails.  The revisions
listed as "add" were already present in the repository and have only
been added to this reference.


Summary of changes:
 source/documentation/tdb/architecture.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

[jena-site] 01/01: Update documentation about TDB1 and TDB2

Posted by an...@apache.org.
This is an automated email from the ASF dual-hosted git repository.

andy pushed a commit to branch tdb-arch
in repository https://gitbox.apache.org/repos/asf/jena-site.git

commit 2559fd881859fcd06452d60c752992dfdb81c0de
Author: Andy Seaborne <an...@apache.org>
AuthorDate: Thu Jan 6 14:44:32 2022 +0000

    Update documentation about TDB1 and TDB2
---
 source/documentation/tdb/architecture.md     | 101 +++++++++++++++++++--------
 source/documentation/tdb/tdb_transactions.md |  78 ++++++++-------------
 2 files changed, 104 insertions(+), 75 deletions(-)

diff --git a/source/documentation/tdb/architecture.md b/source/documentation/tdb/architecture.md
index 930e12a..bd2a17d 100644
--- a/source/documentation/tdb/architecture.md
+++ b/source/documentation/tdb/architecture.md
@@ -2,8 +2,8 @@
 title: TDB Architecture
 ---
 
-This page gives an overview of the TDB architecture. Specific
-details refer to TDB 0.8.
+This page gives an overview of the TDB architecture.
+It applies to TDB1 and TDB2 with differences noted.
 
 ## Contents
 
@@ -13,6 +13,7 @@ details refer to TDB 0.8.
     -   [Triple and Quad indexes](#triple-and-quad-indexes)
     -   [Prefixes Table](#prefixes-table)
     -   [TDB B+Trees](#tdb-btrees)
+    -   [Transactions](#tdb-transactions)
 -   [Inline values](#inline-values)
 -   [Query Processing](#query-processing)
 -   [Caching on 32 and 64 bit Java systems](#caching-on-32-and-64-bit-java-systems)
@@ -39,8 +40,7 @@ filing system. A dataset consists of
 
 The node table stores the representation of RDF terms (except for
 inlined value - see below). It provides two mappings from Node to
-NodeId and from NodeId to Node. This is sometimes called a
-dictionary.
+NodeId and from NodeId to Node.
 
 The Node to NodeId mapping is used during data loading and when
 converting constant terms in queries from their Jena Node
@@ -88,23 +88,72 @@ or
 ### TDB B+Trees
 
 Many of the persistent data structures in TDB use a custom
-implementation of threaded
+implementation of 
 [B+Trees](http://en.wikipedia.org/wiki/B+_tree "http://en.wikipedia.org/wiki/B%2B_tree").
 The TDB implementation only provides for fixed length key and fixed
-length value. There is no use of the value part in triple indexes.
+length value. There is no use of the value part in triple and quads indexes.
 
-The threaded nature means that long scans of indexes proceeds
-without needing to traverse the branches of the tree.
+### Transactions {#tdb-transactions}
 
-See the description of index caching below.
+Both TDB1 and TDB2 provide database transactions.
+The API is described on the [Jena Transactions page](/docuemntation/txn/ "Jena Transactions").
+
+When running with transactions, TDB1 and TDB2 provide support for multiple read
+and write transactions without application involvement. There will be multiple
+readers active, and also a single writer active (referred to as "MR+SW"). TDB
+itself manages multiple writers, queuing them as necessary.
+
+To support transactions, TDB2 uses copy-on-write MVCC data structures internally.
+
+TDB1 can run non-transactionally but the application is responsible for ensuring
+that there is one writer or several readers, not both. This is referred to as
+"MRSW". Misuse of the TDb1 in this mode can corrupt the database.
 
 ## Inline values
 
-Values of certain datatypes are held as part of the NodeId in the
-bottom 56 bits. The top 8 bits indicates the type - external NodeId
-or the value space.
+Values of certain datatypes are held as part of the NodeId.
+The top bit indicates whether the remaining 63 bits are a position in the stored
+RDF terms file (high bit is 0) or an encoded value (high bit 1).
+
+By storing the value, the exact lexical form is not recorded. The
+integers 01 and 1 will both be treated as the value 1.
+
+### TDB2
+
+The TDB2 encoding is as follows:
+
+* High bit (bit 63) 0 means the node is in the object table (PTR).
+* High bit (bit 63) 1, bit 62 1: double as 62 bits.
+* High bit (bit 63) 1, bit 62 0: 6 bits of type, 56 bits of value.
+ 
+If a value would not fit, it will be stored externally so there is no
+guarantee that all integers, say, are store inline.
+ 
+* Integer format: signed 56 bit number, the type field has the XSD type.
+* Derived types of integer, each with their own datatype.
+* Decimal format: 8 bits scale, 48bits of signed valued.
+* Date and DateTime
+* Boolean
+* Float
+
+In the case of xsd:double, the standard Java 64 bit format is used except that the range
+of the exponent is reduced by 2 bits.
+
+* bit  63    : sign bit
+* bits 52-62 : exponent, 11 bits, the power of 2, bias -1023.
+* bits 0-51  : mantissa (significand) 52 bits (the leading one is not stored).
 
-The value spaces handled are (TDB 0.8): xsd:decimal, xsd:integer,
+Exponents are 11 bits, with values -1022 to +1023 held as 1 to 2046 (11 bits, bias -1023)
+Exponents 0x000 and 0x7ff have a special meaning:
+
+The xsd:dateTime and xsd:date ranges cover about 8000 years from
+year zero with a precision down to 1 millisecond. Timezone
+information is retained to an accuracy of 15 minutes with special
+timezones for Z and for no explicit timezone.
+
+### TDB1
+
+The value spaces handled are: xsd:decimal, xsd:integer,
 xsd:dateTime, xsd:date and xsd:boolean. Each has its own encoding
 to fit in 56 bits. If a node falls outside of the range of values
 that can be represented in the 56 bit encoding.
@@ -114,31 +163,27 @@ year zero with a precision down to 1 millisecond. Timezone
 information is retained to an accuracy of 15 minutes with special
 timezones for Z and for no explicit timezone.
 
-By storing the value, the exact lexical form is not recorded. The
-integers 01 and 1 will both be treated as the value 1.
-
 Derived XSD datatypes are held as their base type. The exact
 datatype is not retained; the value of the RDF term is.
+An input of `xsd:int` will become `xsd:integer`.
 
 ## Query Processing
 
-TDB uses the
-[OpExecutor extension point of ARQ](TODO).
+TDB uses quad-execution rewriting SPARQL algebra `(graph...)` to blocks of quads
+where possible. It extends `OpExecutor`.
 TDB provides low level optimization of basic graph patterns using a
-[statistics based optimizer](optimizer.html "TDB/Optimizer").
+[statistics based optimizer](optimizer.html).
 
 ## Caching on 32 and 64 bit Java systems
 
-TDB runs on both 32-bit and 64-bit Java Virtual Machines. The same
-file formats are used on both systems and database files can be
-transferred between architectures (no TDB system should be running
-for the database at the time of copy). What differs is the file
-access mechanism used.
-
-TDB is faster on a 64 bit JVM because more memory is available for
-file caching.
+TDB runs on both 32-bit and 64-bit Java Virtual Machines.  A 64-bit Java Virtual
+Machine is the normal mode of use.  The same file formats are used on both
+systems and database files can be transferred between architectures (no TDB
+system should be running for the database at the time of copy). What differs is
+the file access mechanism used.
 
-The node table caches are always in the Java heap.
+The node table caches are always in the Java heap but otherwise the OS file
+system plays an important part in index caching.
 
 The file access mechanism can be set explicitly, but this is not a
 good idea for production usage, only for experimentation - see the
diff --git a/source/documentation/tdb/tdb_transactions.md b/source/documentation/tdb/tdb_transactions.md
index ba063ff..bf716b3 100644
--- a/source/documentation/tdb/tdb_transactions.md
+++ b/source/documentation/tdb/tdb_transactions.md
@@ -5,13 +5,13 @@ title: TDB Transactions
 TDB provides
 [ACID](http://en.wikipedia.org/wiki/ACID)
 transaction support through the use of
-[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging).
+[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging) in TDB1
+and copy-on-write MVCC structures in TDB2.
 
-Use of transactions protects a TDB dataset
-against data corruption, unexpected process termination and system crashes and therefore use of transactions is **strongly** recommended.
+Use of transactions protects a TDB dataset against data corruption, unexpected
+process termination and system crashes. 
 
-This feature is part of version TDB 0.9.0 and later.  Databases created with version of TDB 0.8.X can be used with 0.9.X
-to add transactional capability.
+Non-transactional use of TDB1 should be avoided; TDB2 only operates with transactions.
 
 ## Contents
 
@@ -23,18 +23,19 @@ to add transactional capability.
 -   [Multi-threaded use](#multi-threaded-use)
 -   [Bulk loading](#bulk-loading)
 -   [Multi JVM](#multi-jvm)
--   [Migration from TDB 0.8.X](#migration-from-tdb-08x)
--   [Reverting to TDB 0.8.X](#reverting-to-tdb-08x)
 
 ## Overview
 
-The transaction mechanism in TDB is based on
-[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging).
-All changes made inside a write-transaction are written to
-[journals](http://en.wikipedia.org/wiki/Journaling_file_system),
-then propagated to the main database at a suitable moment. This
-design allows for read-transactions to proceed without locking or
-other overhead over the base database.
+TDB2 uses [MVCC](https://en.wikipedia.org/wiki/Multiversion_concurrency_control)
+via a copy-on-write mechanism. Update transactions can be of any size.
+
+The TDB1 transaction mechanism is based on
+[write-ahead-logging](http://en.wikipedia.org/wiki/Write-ahead_logging).  All
+changes made inside a write-transaction are written to
+[journals](http://en.wikipedia.org/wiki/Journaling_file_system), then propagated
+to the main database at a suitable moment.  Transactions is TDB1 are limited in
+size to a few 10's of million triples because they retain data in-memory until
+indexes can be updated.
 
 Transactional TDB supports one active write transaction, and
 multiple read transactions at the same time. Read-transactions
@@ -59,23 +60,26 @@ transactions, the highest
 (some of these limitations may be removed in later versions)
 
 -   Bulk loads: the TDB bulk loader is not transactional
--   [Nested transactions](http://en.wikipedia.org/wiki/Nested_transaction) are not supported.
+-   [Nested transactions](http://en.wikipedia.org/wiki/Nested_transaction) 
+    are not supported.
+
+TDB2 remved the limitations of TDB1:
+
 -   Some active transaction state is held exclusively in-memory,
     limiting scalability.
 -   Long-running transactions. Read-transactions cause a build-up
     of pending changes;
 
 If a single read transaction runs for a long time when there are
-many updates, the system will consume a lot of temporary
+many updates, the TDB1 system will consume a lot of temporary
 resources.
 
 ## API for Transactions
 
-TDB supports the general Jena API for transactions on RDF datasets 
-(introduced in Jena 2.7.0, ARQ 2.9.0).
+Ths section uses the primitives of the transaction mechanism. 
 
-A TDB-backed dataset can be used non-transactionally but once used in a transaction, 
-it must be used transactionally after that.
+Better APIs are described in [the transaction API
+documentation](/documentation/txn/).
 
 ### Read transactions
 
@@ -218,35 +222,15 @@ same storage. in both cases, the transactions are independent.
 Multiple applications, running in multiple JVMs, using the same
 file databases is not supported and has a high risk of data corruption.  Once corrupted a database cannot be repaired
 and must be rebuilt from the original source data. Therefore there **must** be a single JVM
-controlling the database directory and files.  From 1.1.0 onwards TDB includes automatic prevention against multi-JVM
+controlling the database directory and files. TDB includes automatic prevention against multi-JVM
 which prevents this under most circumstances.
 
-Use our [Fuseki](../fuseki2/) component to provide a
-database server for multiple applications. Fuseki supports 
-[SPARQL Query](http://www.w3.org/TR/sparql11-query/),
-[SPARQL Update](http://www.w3.org/TR/sparql11-update/) and the
-[SPARQL Graph Store protocol](http://www.w3.org/TR/sparql11-http-rdf-update/).
+Use [Fuseki](../fuseki2/) to provide a database server for multiple
+applications. Fuseki supports [SPARQL
+Query](http://www.w3.org/TR/sparql11-query/), [SPARQL
+Update](http://www.w3.org/TR/sparql11-update/) and the [SPARQL Graph Store
+protocol](http://www.w3.org/TR/sparql11-http-rdf-update/).
 
 ## Bulk loading
 
-The bulk loader is not transactional.
-
-## Migration from TDB 0.8.X
-
-The database files used by TDB 0.9.0 are fully compatible with TDB
-0.8.X; there are no file format changes and application code using
-the interface provided by `TDBFactory` will continue to work as
-before, without transaction capabilities. The only addition is the
-presence of journal files.
-
-Transactions use a new API: the `TDBFactory` API is still present.
-If an application simply uses the TDB 0.9 codebase, it will work as
-before without transactions.
-
-Applications can start using transaction by coding to the new API.
-
-## Reverting to TDB 0.8.X
-
-A database can be reverted to TDB 0.8.X by running `tdb.tdbrecover`
-- this program recovers any committed transaction with pending
-actions. The database can then be used with TDB 0.8.X.
+Bulk loaders are not transactional.