You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2015/06/08 17:41:39 UTC

TDB2

Informational announcement: TDB2

TDB2 is a reworking of TDB based on updated implementations of 
transactions and transactional data structures for project Lizard (a 
clustered SPARQL store).

TDB2 has:

* Arbitrary scale write-once transactions
* New transaction system - can add other first class components.
   (e.g. text indexes, cache tables)
* Models works across transaction boundaries
* Cleaner, simpler, more maintainable

TDB2 databases are not compatible with TDB databases.  It uses a more 
efficient encoding for RDF terms.  [1]

Being a database, the new indexing and transaction code needs time to 
settle to bring the maturity up.  I'm using that tech in Lizard development.

	Andy

TDB2 code:
https://github.com/afs/mantis/tree/master/tdb2

Lizard slides:
http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard


[1] An upgrade path using TDB1-style encoding is possible; it is an 
one-way upgrade path and not reversible [2].  TDB2 adds control files 
for the copy-on-write data structures that TDB1 does not understand.

[2] Actually, if the encoding is compatible, what will happen is that 
TDB1 will see the database at the time of the upgrade.  Welcome to 
copy-on-write immutable data structures.

Re: TDB2

Posted by Andy Seaborne <an...@apache.org>.

On 09/06/15 16:23, ajs6f@virginia.edu wrote:
> Is there some "high level" overview of Lizard/Mantis/TDB2 yet extant? Like the kind of thing we might see at a conference?
>

http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard

and the code on github (currently in my account).

> In any event, thanks for working on this-- it's great to know that Jena will be able to cluster soon.

I was recently looking at bulk loading.  TDB2 loads at 65K triples per 
second (3 indexes) but on the same machine, Lizard, running all server 
nodes in the same JVM and still using Thrift/TCP networking to connect, 
is loading at 100kTPS (2 indexes) and 95kTPS (3 indexes).

The difference is parallelism - Lizard loads the indexes in bulk units 
with only the parser and node table on the main thread.  The bulk 
transfers and the Lizard vnodes are all separate threads.  Some or all 
of that approach applies to TDB2.  Whether it is better to make TDB2 = 
Lizard with in-JVM comms or still a separate project, I don't know.  If 
you look hard at the numbers, there are some inconsistencies; there is 
at least one unnecessary copy into the node table based on poor use of 
Apache Thrift.

	Andy

All figures are loading an empty database from loading 100 million BSBM, 
gzip compressed inside a write transaction.

The empty database is just for uniformity.  TDB2 does not have a 
separate bulkloader (1) not ported from TDB1 and (2) seeing if one is 
needed.

Hardware: Quad core i7, 32G RAM, SSD.  The BSBM data is streamed from 
rotational disk, the raw parser speed is 315 kTPS.

>
> ---
> A. Soroka
> The University of Virginia Library
>
> On Jun 8, 2015, at 1:24 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 08/06/15 17:48, Marco Neumann wrote:
>>> is TDB2 going to replace TDB or is TDB2 a new cluster product?
>>
>> Whatever people (users, developers) want.  Migrating Dbs is not as easy as ungrading code.  Running oaj.tdb and oaj.tdb2 side by side
>>
>> (TDB2 is itself 7 maven modules ATM - some can be combined as they are small and just "a good idea at the time").
>>
>> TDB2 is not the cluster (that's Lizard).  Mantis started as the separation out of the low level code needed for Lizard. Initially validation of the reworking of transaction and datastructures, a little extra work has made it as viable as "TDB2"
>>
>> 	Andy
>>
>> (oaj = org.apache.jena)
>>
>>>
>>> Marco
>>>
>>> On Mon, Jun 8, 2015 at 11:41 AM, Andy Seaborne <an...@apache.org> wrote:
>>>> Informational announcement: TDB2
>>>>
>>>> TDB2 is a reworking of TDB based on updated implementations of transactions
>>>> and transactional data structures for project Lizard (a clustered SPARQL
>>>> store).
>>>>
>>>> TDB2 has:
>>>>
>>>> * Arbitrary scale write-once transactions
>>>> * New transaction system - can add other first class components.
>>>>    (e.g. text indexes, cache tables)
>>>> * Models works across transaction boundaries
>>>> * Cleaner, simpler, more maintainable
>>>>
>>>> TDB2 databases are not compatible with TDB databases.  It uses a more
>>>> efficient encoding for RDF terms.  [1]
>>>>
>>>> Being a database, the new indexing and transaction code needs time to settle
>>>> to bring the maturity up.  I'm using that tech in Lizard development.
>>>>
>>>>          Andy
>>>>
>>>> TDB2 code:
>>>> https://github.com/afs/mantis/tree/master/tdb2
>>>>
>>>> Lizard slides:
>>>> http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard
>>>>
>>>>
>>>> [1] An upgrade path using TDB1-style encoding is possible; it is an one-way
>>>> upgrade path and not reversible [2].  TDB2 adds control files for the
>>>> copy-on-write data structures that TDB1 does not understand.
>>>>
>>>> [2] Actually, if the encoding is compatible, what will happen is that TDB1
>>>> will see the database at the time of the upgrade.  Welcome to copy-on-write
>>>> immutable data structures.
>>>
>>>
>>>
>>
>

Re: TDB2

Posted by Andy Seaborne <an...@apache.org>.

On 09/06/15 16:23, ajs6f@virginia.edu wrote:
> Is there some "high level" overview of Lizard/Mantis/TDB2 yet extant? Like the kind of thing we might see at a conference?
>

http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard

and the code on github (currently in my account).

> In any event, thanks for working on this-- it's great to know that Jena will be able to cluster soon.

I was recently looking at bulk loading.  TDB2 loads at 65K triples per 
second (3 indexes) but on the same machine, Lizard, running all server 
nodes in the same JVM and still using Thrift/TCP networking to connect, 
is loading at 115KTPS (no indexes), 100kTPS (2 indexes) and 95kTPS (3 
indexes).

The difference is parallelism - Lizard loads the indexes in bulk units 
with only the parser and node table on the main thread.  The bulk 
transfers and the service nodes are all separate threads.  Some or all 
of that approach applies to TDB2.  Whether it is better to make TDB2 = 
Lizard with in-JVM comms or still a separate project, I don't know.

	Andy

All figures are approximate and indicative only (only a few runs).
They are all loading an empty database with 100 million BSBM, gzip 
compressed, inside a write transaction.

The empty database is just for uniformity.  TDB2 does not have a 
separate bulkloader (1) not ported from TDB1 and (2) seeing if one is 
needed.

Hardware: Quad core i7, 32G RAM, SSD.  The BSBM data is streamed from 
rotational disk, the raw parser speed is 315 kTPS.

>
> ---
> A. Soroka
> The University of Virginia Library
>
> On Jun 8, 2015, at 1:24 PM, Andy Seaborne <an...@apache.org> wrote:
>
>> On 08/06/15 17:48, Marco Neumann wrote:
>>> is TDB2 going to replace TDB or is TDB2 a new cluster product?
>>
>> Whatever people (users, developers) want.  Migrating Dbs is not as easy as ungrading code.  Running oaj.tdb and oaj.tdb2 side by side
>>
>> (TDB2 is itself 7 maven modules ATM - some can be combined as they are small and just "a good idea at the time").
>>
>> TDB2 is not the cluster (that's Lizard).  Mantis started as the separation out of the low level code needed for Lizard. Initially validation of the reworking of transaction and datastructures, a little extra work has made it as viable as "TDB2"
>>
>> 	Andy
>>
>> (oaj = org.apache.jena)
>>
>>>
>>> Marco
>>>
>>> On Mon, Jun 8, 2015 at 11:41 AM, Andy Seaborne <an...@apache.org> wrote:
>>>> Informational announcement: TDB2
>>>>
>>>> TDB2 is a reworking of TDB based on updated implementations of transactions
>>>> and transactional data structures for project Lizard (a clustered SPARQL
>>>> store).
>>>>
>>>> TDB2 has:
>>>>
>>>> * Arbitrary scale write-once transactions
>>>> * New transaction system - can add other first class components.
>>>>    (e.g. text indexes, cache tables)
>>>> * Models works across transaction boundaries
>>>> * Cleaner, simpler, more maintainable
>>>>
>>>> TDB2 databases are not compatible with TDB databases.  It uses a more
>>>> efficient encoding for RDF terms.  [1]
>>>>
>>>> Being a database, the new indexing and transaction code needs time to settle
>>>> to bring the maturity up.  I'm using that tech in Lizard development.
>>>>
>>>>          Andy
>>>>
>>>> TDB2 code:
>>>> https://github.com/afs/mantis/tree/master/tdb2
>>>>
>>>> Lizard slides:
>>>> http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard
>>>>
>>>>
>>>> [1] An upgrade path using TDB1-style encoding is possible; it is an one-way
>>>> upgrade path and not reversible [2].  TDB2 adds control files for the
>>>> copy-on-write data structures that TDB1 does not understand.
>>>>
>>>> [2] Actually, if the encoding is compatible, what will happen is that TDB1
>>>> will see the database at the time of the upgrade.  Welcome to copy-on-write
>>>> immutable data structures.
>>>
>>>
>>>
>>
>

Re: TDB2

Posted by "ajs6f@virginia.edu" <aj...@virginia.edu>.

Is there some "high level" overview of Lizard/Mantis/TDB2 yet extant? Like the kind of thing we might see at a conference?

In any event, thanks for working on this-- it's great to know that Jena will be able to cluster soon.

---
A. Soroka
The University of Virginia Library

On Jun 8, 2015, at 1:24 PM, Andy Seaborne <an...@apache.org> wrote:

> On 08/06/15 17:48, Marco Neumann wrote:
>> is TDB2 going to replace TDB or is TDB2 a new cluster product?
> 
> Whatever people (users, developers) want.  Migrating Dbs is not as easy as ungrading code.  Running oaj.tdb and oaj.tdb2 side by side
> 
> (TDB2 is itself 7 maven modules ATM - some can be combined as they are small and just "a good idea at the time").
> 
> TDB2 is not the cluster (that's Lizard).  Mantis started as the separation out of the low level code needed for Lizard. Initially validation of the reworking of transaction and datastructures, a little extra work has made it as viable as "TDB2"
> 
> 	Andy
> 
> (oaj = org.apache.jena)
> 
>> 
>> Marco
>> 
>> On Mon, Jun 8, 2015 at 11:41 AM, Andy Seaborne <an...@apache.org> wrote:
>>> Informational announcement: TDB2
>>> 
>>> TDB2 is a reworking of TDB based on updated implementations of transactions
>>> and transactional data structures for project Lizard (a clustered SPARQL
>>> store).
>>> 
>>> TDB2 has:
>>> 
>>> * Arbitrary scale write-once transactions
>>> * New transaction system - can add other first class components.
>>>   (e.g. text indexes, cache tables)
>>> * Models works across transaction boundaries
>>> * Cleaner, simpler, more maintainable
>>> 
>>> TDB2 databases are not compatible with TDB databases.  It uses a more
>>> efficient encoding for RDF terms.  [1]
>>> 
>>> Being a database, the new indexing and transaction code needs time to settle
>>> to bring the maturity up.  I'm using that tech in Lizard development.
>>> 
>>>         Andy
>>> 
>>> TDB2 code:
>>> https://github.com/afs/mantis/tree/master/tdb2
>>> 
>>> Lizard slides:
>>> http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard
>>> 
>>> 
>>> [1] An upgrade path using TDB1-style encoding is possible; it is an one-way
>>> upgrade path and not reversible [2].  TDB2 adds control files for the
>>> copy-on-write data structures that TDB1 does not understand.
>>> 
>>> [2] Actually, if the encoding is compatible, what will happen is that TDB1
>>> will see the database at the time of the upgrade.  Welcome to copy-on-write
>>> immutable data structures.
>> 
>> 
>> 
>

Re: TDB2

Posted by Andy Seaborne <an...@apache.org>.

On 08/06/15 17:48, Marco Neumann wrote:
> is TDB2 going to replace TDB or is TDB2 a new cluster product?

Whatever people (users, developers) want.  Migrating Dbs is not as easy 
as ungrading code.  Running oaj.tdb and oaj.tdb2 side by side

(TDB2 is itself 7 maven modules ATM - some can be combined as they are 
small and just "a good idea at the time").

TDB2 is not the cluster (that's Lizard).  Mantis started as the 
separation out of the low level code needed for Lizard. Initially 
validation of the reworking of transaction and datastructures, a little 
extra work has made it as viable as "TDB2"

	Andy

(oaj = org.apache.jena)

>
> Marco
>
> On Mon, Jun 8, 2015 at 11:41 AM, Andy Seaborne <an...@apache.org> wrote:
>> Informational announcement: TDB2
>>
>> TDB2 is a reworking of TDB based on updated implementations of transactions
>> and transactional data structures for project Lizard (a clustered SPARQL
>> store).
>>
>> TDB2 has:
>>
>> * Arbitrary scale write-once transactions
>> * New transaction system - can add other first class components.
>>    (e.g. text indexes, cache tables)
>> * Models works across transaction boundaries
>> * Cleaner, simpler, more maintainable
>>
>> TDB2 databases are not compatible with TDB databases.  It uses a more
>> efficient encoding for RDF terms.  [1]
>>
>> Being a database, the new indexing and transaction code needs time to settle
>> to bring the maturity up.  I'm using that tech in Lizard development.
>>
>>          Andy
>>
>> TDB2 code:
>> https://github.com/afs/mantis/tree/master/tdb2
>>
>> Lizard slides:
>> http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard
>>
>>
>> [1] An upgrade path using TDB1-style encoding is possible; it is an one-way
>> upgrade path and not reversible [2].  TDB2 adds control files for the
>> copy-on-write data structures that TDB1 does not understand.
>>
>> [2] Actually, if the encoding is compatible, what will happen is that TDB1
>> will see the database at the time of the upgrade.  Welcome to copy-on-write
>> immutable data structures.
>
>
>

Re: TDB2

Posted by Marco Neumann <ma...@gmail.com>.

is TDB2 going to replace TDB or is TDB2 a new cluster product?

Marco

On Mon, Jun 8, 2015 at 11:41 AM, Andy Seaborne <an...@apache.org> wrote:
> Informational announcement: TDB2
>
> TDB2 is a reworking of TDB based on updated implementations of transactions
> and transactional data structures for project Lizard (a clustered SPARQL
> store).
>
> TDB2 has:
>
> * Arbitrary scale write-once transactions
> * New transaction system - can add other first class components.
>   (e.g. text indexes, cache tables)
> * Models works across transaction boundaries
> * Cleaner, simpler, more maintainable
>
> TDB2 databases are not compatible with TDB databases.  It uses a more
> efficient encoding for RDF terms.  [1]
>
> Being a database, the new indexing and transaction code needs time to settle
> to bring the maturity up.  I'm using that tech in Lizard development.
>
>         Andy
>
> TDB2 code:
> https://github.com/afs/mantis/tree/master/tdb2
>
> Lizard slides:
> http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard
>
>
> [1] An upgrade path using TDB1-style encoding is possible; it is an one-way
> upgrade path and not reversible [2].  TDB2 adds control files for the
> copy-on-write data structures that TDB1 does not understand.
>
> [2] Actually, if the encoding is compatible, what will happen is that TDB1
> will see the database at the time of the upgrade.  Welcome to copy-on-write
> immutable data structures.



-- 


---
Marco Neumann
KONA

Re: TDB2

Posted by Andy Seaborne <an...@apache.org>.

On 09/06/15 08:38, Osma Suominen wrote:
> On 08/06/15 18:57, Andy Seaborne wrote:
>
>> TDB2 is transactional use only.
>> Additional fun with Java8: all the begin/commit foo is hidden.
>
> Sounds very good. Does it reuse disk space (eventually) after triples
> are deleted?
>
> I'm referring to JENA-804 here, for which I'd like to see a solution,
> but I understood it's not very simple.

Certainly be good to address JENA-804, in whole or in part (i.e. 
mitigate teh effects even if it can't be completely removed without a 
disk format change).

It is definitely at the "not simple" (well, "not quick") end of things.

It's finding a block of time that the block.  And being about persistent 
data, proceeding carefully and testing a lot is essential.  Which is 
more time.

	Andy

>
> -Osma
>
>

Re: TDB2

Posted by Osma Suominen <os...@helsinki.fi>.

On 08/06/15 18:57, Andy Seaborne wrote:

> TDB2 is transactional use only.
> Additional fun with Java8: all the begin/commit foo is hidden.

Sounds very good. Does it reuse disk space (eventually) after triples 
are deleted?

I'm referring to JENA-804 here, for which I'd like to see a solution, 
but I understood it's not very simple.

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: TDB2

Posted by Andy Seaborne <an...@apache.org>.

On 08/06/15 16:41, Andy Seaborne wrote:
> Informational announcement: TDB2
>
> TDB2 is a reworking of TDB based on updated implementations of
> transactions and transactional data structures for project Lizard (a
> clustered SPARQL store).
>
> TDB2 has:
>
> * Arbitrary scale write-once transactions
> * New transaction system - can add other first class components.
>    (e.g. text indexes, cache tables)
> * Models works across transaction boundaries
> * Cleaner, simpler, more maintainable
>
> TDB2 databases are not compatible with TDB databases.  It uses a more
> efficient encoding for RDF terms.  [1]
>
> Being a database, the new indexing and transaction code needs time to
> settle to bring the maturity up.  I'm using that tech in Lizard
> development.
>
>      Andy
>
> TDB2 code:
> https://github.com/afs/mantis/tree/master/tdb2
>
> Lizard slides:
> http://www.slideshare.net/andyseaborne/201411-apache-coneu-lizard
>
>
> [1] An upgrade path using TDB1-style encoding is possible; it is an
> one-way upgrade path and not reversible [2].  TDB2 adds control files
> for the copy-on-write data structures that TDB1 does not understand.
>
> [2] Actually, if the encoding is compatible, what will happen is that
> TDB1 will see the database at the time of the upgrade.  Welcome to
> copy-on-write immutable data structures.

TDB2 is transactional use only.
Additional fun with Java8: all the begin/commit foo is hidden.

    Dataset ds = TDBFactory.createDataset() ;

Here is a write transaction to load a file:

   TDBTxn.executeWrite(ds, ()->RDFDataMgr.read(ds, "http:...")) ;

Or to get the size of the default model safely:

   long size =
     TDBTxn.executeReadReturn(ds, ()->ds.getDefaultModel().size()) ;

	Andy