You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Nandika Jain <na...@iiitd.ac.in> on 2022/04/12 08:59:38 UTC

Query regarding architecture of TDB store

  Hello
While reading the documentation for the architecture of TDB store, we came
across the usage of B-Trees in node tables. As far as we understand, the
architecture uses B-Trees over the MD5 hashes of a node to map Node to
NodeId.
We wanted to ask, why is a B-Tree used in place of a HashMap here. This
kind of problem of mapping a key to a value is generally easily solvable
using hash-maps.
It would be very helpful if you could throw some light upon why you chose
to use B-Trees over hash-maps for this kind of mapping.
One of the reasons we think you might have used B-trees is since it allows
you to have a 128-bit MD5 hash which is very secure, because the
probability of collision is now 2^-128. Instead if you used hash-maps, the
output of the hash-function used internally could not have been 128 bits
large (because of space constraints) and thus increasing the chance of
collisions.
We would be grateful if you could also comment upon this above conjecture.
Thank you in advance


-- 
Nandika Jain
CSE Junior Undergrad | IIIT-Delhi

Re: Query regarding architecture of TDB store

Posted by "rvesse@dotnetrdf.org" <rv...@dotnetrdf.org>.
I would start by noting that:


  1.  Our node IDs are not hash bashed.  They are 64 bit identifiers which either represent an inlined value (https://jena.apache.org/documentation/tdb/architecture.html#inline-values) or an offset to the actual RDF term in the node table file.
  2.  We need to do lookups in both directions, from RDF Term to Node ID, and Node ID back to RDF Term

Inlining values means that not all lookups (in either direction) need to touch the node tables at all, any RDF Term that can be inlined can have its Node ID calculated directly, or its RDF Term recreated from the inlined value.

Hash maps are fine when things fit in-memory but one of the key design considerations of TDB is that it copes with very large datasets that cannot be held directly in memory.  B-Tree’s are particularly amenable to serialization onto disk in a block layout.  This means that we can efficiently access them on disk, or cache portions of them in-memory via memory mapped files.  This allows TDB to scale up to massive datasets that we could not hold in memory because we only need to store portions of it in memory at any time.  This is mostly managed for us by the OS through usage of memory mapped files.

Note that we don’t use cryptographic hashes for our Node IDs.  It’s either inlined (as already discussed) or it’s an offset into the actual node table file such that we can efficiently seek to the position for the RDF Term and read its value back out.  Collisions aren’t an issue because for inlined values it’s a deterministic mapping i.e. if a RDF Term can be inlined it always produces the same Node ID, and for any other term it’s the next free offset in the node table which always increases in value since we don’t do reference counting or recycling of Node IDs.

In practical terms there are hash maps used as in-memory caches for both RDF Term to Node ID, and Node ID to RDF Term, so that we can avoid unnecessary disk access for frequently and recently used RDF terms/Node IDs.

Rob

From: Nandika Jain <na...@iiitd.ac.in>
Date: Tuesday, 12 April 2022 at 10:00
To: users@jena.apache.org <us...@jena.apache.org>, Jishnu Raj Parashar <ji...@iiitd.ac.in>, Gaurav Dawra <ga...@iiitd.ac.in>, dev@jena.apache.org <de...@jena.apache.org>
Subject: Query regarding architecture of TDB store
  Hello
While reading the documentation for the architecture of TDB store, we came
across the usage of B-Trees in node tables. As far as we understand, the
architecture uses B-Trees over the MD5 hashes of a node to map Node to
NodeId.
We wanted to ask, why is a B-Tree used in place of a HashMap here. This
kind of problem of mapping a key to a value is generally easily solvable
using hash-maps.
It would be very helpful if you could throw some light upon why you chose
to use B-Trees over hash-maps for this kind of mapping.
One of the reasons we think you might have used B-trees is since it allows
you to have a 128-bit MD5 hash which is very secure, because the
probability of collision is now 2^-128. Instead if you used hash-maps, the
output of the hash-function used internally could not have been 128 bits
large (because of space constraints) and thus increasing the chance of
collisions.
We would be grateful if you could also comment upon this above conjecture.
Thank you in advance


--
Nandika Jain
CSE Junior Undergrad | IIIT-Delhi