You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Paul Houle <on...@gmail.com> on 2013/06/27 00:32:06 UTC
Fast & Reliable Node Parser

     I've had many requests to port some of the advances in my
infovore framework to Jena and now I'm getting around to that.

     My program Infovore at github

https://github.com/paulhoule/infovore

     has a module called "parallel super eyeball" which,  like the
eyeball program,  checks a RDF file for trouble,  but does not crash
when it finds it.  One simplifing trick was to accept only N-Triples
and close variants,  such as the Freebase export files.  This means I
can reliable break up triples into nodes by breaking on the first two
bits of whitespace,  then parse the nodes separately.

     I hacked away at the triple parser in Jena to produce something
that parses a single node and I did it in a surgical way so there is a
pretty good chance it is correct.  The result is here:

https://github.com/paulhoule/infovore/tree/master/
millipede/src/main/java/com/ontology2/rdf/parser

     The real trouble with it is that it is terribly slow,  so slow
that I was about to give up on it before introducing a parse cache,
which is the function createNodeParseCache() in

https://github.com/paulhoule/infovore/blob/master/
millipede/src/main/java/com/ontology2/rdf/JenaUtil.java

     This sped it up to the point where I was not motivated to work
out how to speed it up,  but this work should happen.  I'm sure that
the parser is doing a lot of set-up work,  some of which is
superfluous,  and also I'm certain that a handwritten parser could be
faster than the generated parser as well.  Seeing how many billion of
triples there are ought there,  a handwritten node parser may be worth
the effort.

----

    On another note,  I couldn't help but notice that it's easy to
fill up memory with identical Node objects as seen in the following
test:

https://github.com/paulhoule/infovore/blob/master/
millipede/src/test/java/com/ontology2/rdf/UnderstandNodeMemoryBehavior.java

    Given that many graphs repeat the same node values a lot,  I wrote
some Economizer classes,  tested in there,  that make a cache for
recently created Node and Triple objects.  Perhaps I was being a bit
silly to expect to sort very large arrays of triples in memory,  but I
found I was greatly able to reduce the memory required by using
"Economization".

Any thoughts?