You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Andy Seaborne <an...@apache.org> on 2013/03/30 19:51:59 UTC
RIOT, blank nodes and JENA-352
Heads-up for a change to blank nodes produced by RIOT when parsing (but
not when using RDF/XML).
tl:dr
Parsing data with vast numbers of blank nodes in a single file now
scales better.
Appearance of N-Triples and N-Quads output changes slightly.
Fully compatible with existing data.
https://issues.apache.org/jira/browse/JENA-352
Details:
The blank node allocator has to ensure that two uses of the same label
always generate the same blank node even is uned in the first line of a
file and the last line.
To do this, RIOT was keeping a map of label to allocated node. At
scale, this fails as it uses memory (although you do need a lot of blank
node labels for it to become serious).
A new policy is to use an with a seed value per parser run, which is
combined with any string label to produce a globally unique id. There
is also an LRU cache of 1000 slots to do map-like sharing and avoid
excessive calls of to MD5 digest engine. Typically, a blank node label
is used in a short section of the file much of the time (think blank
nodes as subjects or blank nodes in structure values and lists).
The seed is a random UUID (122 bits of randomness). The label is
combined with the seed by converting to UTF8 bytes and using MD5 to give
a 128bit hashed value which is assumed to be globally unique. Using MD5
makes it fixed length which is a convenient. As we are not requiring an
unattackable policy, MD5 is acceptable.
This change is observable - the format of blank nodes printed in
N-Triples and N-Quads changes slightly. N-Triples and N-Quads print
bNodes using the internal label (so work at arbitrary scale, and can
even be used to restore blank nodes as described below).
The old allocator used a java.net.UID, which had : and - characters it
in. These were encoded as Xhh for two hex digits (x3A and x2D).
The new format is slightly shorted, and does not have Xhh encoded
characters
Old:
_:BX2D5bbaf4a1X3A13dbc7e7182X3AX2D7fff
New:
_:B70db88eb40afc13d2ab37d161e36392e
Printed labels start "B", a letter, to keep them compatible with pre RDF
1.1 parsers. Blank node labels can begin with a digit in RDF 1.1. _:1
is a legal bnode label in RDF 1.1.
This change does not invalidate any existing data (nothing should depend
on the format of blank nodes, only uniqueness of ids). Specifically all
existing persistently stored data still valid. It'll print old style.
Parsing speed should not affected.
Restoring blank nodes:
Blank nodes in NT and NQ dumps can be restored by rewriting the NT/NQ
blank nodes _:Blabel as <_:label>, a pseudo URI scheme that tell RIOT
(and in SPARQL) to use the given label. Use with care. Remember to
remove the 'B'.
To restore old behaviour:
If you think anything odd has changed, you can check by restroing the
old behaviour. In class 'SyntaxLabels' replace, in the second static
function:
LabelToNode.createScopeByDocumentHash()
with
LabelToNode.createScopeByDocument()
Andy