You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2013/03/30 19:51:59 UTC
RIOT, blank nodes and JENA-352

Heads-up for a change to blank nodes produced by RIOT when parsing (but 
not when using RDF/XML).

tl:dr

Parsing data with vast numbers of blank nodes in a single file now 
scales better.

Appearance of N-Triples and N-Quads output changes slightly.

Fully compatible with existing data.

https://issues.apache.org/jira/browse/JENA-352

Details:

The blank node allocator has to ensure that two uses of the same label 
always generate the same blank node even is uned in the first line of a 
file and the last line.

To do this, RIOT was keeping a map of label to allocated node.  At 
scale, this fails as it uses memory (although you do need a lot of blank 
node labels for it to become serious).

A new policy is to use an with a seed value per parser run, which is 
combined with any string label to produce a globally unique id.  There 
is also an LRU cache of 1000 slots to do map-like sharing and avoid 
excessive calls of to MD5 digest engine.  Typically, a blank node label 
is used in a short section of the file much of the time (think blank 
nodes as subjects or blank nodes in structure values and lists).

The seed is a random UUID (122 bits of randomness).  The label is 
combined with the seed by converting to UTF8 bytes and using MD5 to give 
a 128bit hashed value which is assumed to be globally unique.  Using MD5 
makes it fixed length which is a convenient. As we are not requiring an 
unattackable policy, MD5 is acceptable.

This change is observable - the format of blank nodes printed in 
N-Triples and N-Quads changes slightly.  N-Triples and N-Quads print 
bNodes using the internal label (so work at arbitrary scale, and can 
even be used to restore blank nodes as described below).

The old allocator used a java.net.UID, which had : and - characters it 
in.  These were encoded as Xhh for two hex digits (x3A and x2D).

The new format is slightly shorted, and does not have Xhh encoded 
characters

Old:
_:BX2D5bbaf4a1X3A13dbc7e7182X3AX2D7fff

New:
_:B70db88eb40afc13d2ab37d161e36392e

Printed labels start "B", a letter, to keep them compatible with pre RDF 
1.1 parsers.  Blank node labels can begin with a digit in RDF 1.1.  _:1 
is a legal bnode label in RDF 1.1.

This change does not invalidate any existing data (nothing should depend 
on the format of blank nodes, only uniqueness of ids).  Specifically all 
existing persistently stored data still valid.  It'll print old style.

Parsing speed should not affected.

Restoring blank nodes:

Blank nodes in NT and NQ dumps can be restored by rewriting the NT/NQ 
blank nodes _:Blabel as <_:label>, a pseudo URI scheme that tell RIOT 
(and in SPARQL) to use the given label.  Use with care.  Remember to 
remove the 'B'.

To restore old behaviour:

If you think anything odd has changed, you can check by restroing the 
old behaviour. In class 'SyntaxLabels' replace, in the second static 
function:

LabelToNode.createScopeByDocumentHash()

with

LabelToNode.createScopeByDocument()

	Andy