You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by Paul Houle <on...@gmail.com> on 2013/06/26 23:53:51 UTC

Fast and reliable node parsing

     I've had many requests to port some of the advances in my
infovore framework to Jena and now I'm getting around to that.

     My program Infovore at github

https://github.com/paulhoule/infovore

     has a module called "parallel super eyeball" which,  like the
eyeball program,  checks a RDF file for trouble,  but does not crash
when it finds it.  One simplifing trick was to accept only N-Triples
and close variants,  such as the Freebase export files.  This means I
can reliable break up triples into nodes by breaking on the first two
bits of whitespace,  then parse the nodes separately.

     I hacked away at the triple parser in Jena to produce something
that parses a single node and I did it in a surgical way so there is a
pretty good chance it is correct.  The result is here:

https://github.com/paulhoule/infovore/tree/master/millipede/src/main/java/com/ontology2/rdf/parser

     The real trouble with it is that it is terribly slow,  so slow
that I was about to give up on it before introducing a parse cache,
which is the function createNodeParseCache() in

https://github.com/paulhoule/infovore/blob/master/millipede/src/main/java/com/ontology2/rdf/JenaUtil.java

     This sped it up to the point where I was not motivated to work
out how to speed it up,  but this work should happen.  I'm sure that
the parser is doing a lot of set-up work,  some of which is
superfluous,  and also I'm certain that a handwritten parser could be
faster than the generated parser as well.  Seeing how many billion of
triples there are ought there,  a handwritten node parser may be worth
the effort.

----

    On another note,  I couldn't help but notice that it's easy to
fill up memory with identical Node objects as seen in the following
test:

https://github.com/paulhoule/infovore/blob/master/millipede/src/test/java/com/ontology2/rdf/UnderstandNodeMemoryBehavior.java

    Given that many graphs repeat the same node values a lot,  I wrote
some Economizer classes,  tested in there,  that make a cache for
recently created Node and Triple objects.  Perhaps I was being a bit
silly to expect to sort very large arrays of triples in memory,  but I
found I was greatly able to reduce the memory required by using
"Economization".

Any thoughts?

Re: Fast and reliable node parsing

Posted by Andy Seaborne <an...@apache.org>.
Hi Paul,

On parsing:

Jena has a new IO subsystem, org.apache.jena.riot, that replaces the 
Turtle and NTriples parsers and writers in jena-core, adding TriG and 
N-Quads, RDF/JSON (and there is a JSON-LD adapter floating around).

http://jena.apache.org/documentation/io/index.html

It is currently in jena-arq because it also covers dataset I/O and the 
dataset classes are in ARQ (historical).

The new parsers are faster (sometimes, much faster in the case of 
N-triples) and are RDF 1.1 compliant.

The parser that you've found is the old JavaCC based Turtle parser. 
It's still in jena-core only to maintain compatibility when using 
jena-core alone (which is discouraged).

JavaCC is slower than a handcrafted parser.  It's tokenizing algorithm 
is not bad but is more general, and is less tuned to the use case.  (It 
also has unpleasant failure modes on bad UTF-8 but that's a another story)

RIOT uses it's own hand-written tokenizer, TokenizerText, and deals with 
Java byte/character IO efficiently.  Both of these greatly speed up 
parsing.  Using a hand-coded parser also helps but the main improvement 
is in careful use of bytes->chars and in the tokenization.

The tokenizer does about 1e6 tokens/second on a workstation class 
machine (and for the archives - in 2011!). Parsing is 150-200k 
triples/second for N-triples; 100K for Turtle - all subject to what he 
data looks like (large literals will slow it down).  RIOT reads from 
gzip files as well - this makes very little difference and also is 
comparable to using "gzip -d < FILE | riot ..."

If you can see a way to make it faster still, then do let us know.  I've 
done a fair amount of profilling so (I'd hope!) speed up will come from 
better algorithms, not hot spot removal.

On caching:

There used to be a node cache - we found it didn't speed things up 
appreciably and it was removed.  The space issues weren't considered so 
important for general use.  It's tricky to have a general cache that 
helps everyone.

I think that node caching would be good but placed somewhere on the 
input path, not in the NodeFactory, so that general API use isn't 
automatically affected by something that may have zero, positive or 
negative benefit.  For example, ARQ and TDB are forever creating and 
dropping Nodes - the JVM these days is quite good at shortlived objects.

All nodes from parsing get created by the ParserProfile. A ParserProfile 
has various policies so the parsing process is parameterized by this 
object.  It could add caching and reuse of Nodes.  One obvious thing in 
to build-in certain vocabularies : RDF, RDFS, a few easy literals 
(-1,0,1,2,true, false) as well as a small LRU cache.  I say small 
because it minimises impact where it has little effect.

See ParserProfileBase.create(Node currentGraph, Token token)

I don't think this will speed things up - it happens after the bulk of 
the parser inner loop critical path has been done - it will save space.

Two other places where a cache might be effective:

- in the graph (storage) implementation.  When a triple is added, it can 
be analysed for reusing Nodes, deciding how much effort to put into caching.

- in the parser output stream - the destination for parsing is a 
StreamRDF so usually that is an adapter from the stream from the parser 
to the graph/datatset.

Thoughts? Comments? Contributions?

	Andy

On 26/06/13 22:53, Paul Houle wrote:
>       I've had many requests to port some of the advances in my
> infovore framework to Jena and now I'm getting around to that.
>
>       My program Infovore at github
>
> https://github.com/paulhoule/infovore
>
>       has a module called "parallel super eyeball" which,  like the
> eyeball program,  checks a RDF file for trouble,  but does not crash
> when it finds it.  One simplifing trick was to accept only N-Triples
> and close variants,  such as the Freebase export files.  This means I
> can reliable break up triples into nodes by breaking on the first two
> bits of whitespace,  then parse the nodes separately.
>
>       I hacked away at the triple parser in Jena to produce something
> that parses a single node and I did it in a surgical way so there is a
> pretty good chance it is correct.  The result is here:
>
> https://github.com/paulhoule/infovore/tree/master/millipede/src/main/java/com/ontology2/rdf/parser
>
>       The real trouble with it is that it is terribly slow,  so slow
> that I was about to give up on it before introducing a parse cache,
> which is the function createNodeParseCache() in
>
> https://github.com/paulhoule/infovore/blob/master/millipede/src/main/java/com/ontology2/rdf/JenaUtil.java
>
>       This sped it up to the point where I was not motivated to work
> out how to speed it up,  but this work should happen.  I'm sure that
> the parser is doing a lot of set-up work,  some of which is
> superfluous,  and also I'm certain that a handwritten parser could be
> faster than the generated parser as well.  Seeing how many billion of
> triples there are ought there,  a handwritten node parser may be worth
> the effort.
>
> ----
>
>      On another note,  I couldn't help but notice that it's easy to
> fill up memory with identical Node objects as seen in the following
> test:
>
> https://github.com/paulhoule/infovore/blob/master/millipede/src/test/java/com/ontology2/rdf/UnderstandNodeMemoryBehavior.java
>
>      Given that many graphs repeat the same node values a lot,  I wrote
> some Economizer classes,  tested in there,  that make a cache for
> recently created Node and Triple objects.  Perhaps I was being a bit
> silly to expect to sort very large arrays of triples in memory,  but I
> found I was greatly able to reduce the memory required by using
> "Economization".
>
> Any thoughts?
>


Re: Parsing a Single Node

Posted by Rob Vesse <rv...@yarcdata.com>.
This is not a reply to your main email topic hence the changed subject line

SSE.parseNode() parses in a single node from a string optionally using a
PrefixMapping to expand prefixed names.  Tt allows for Turtle/SPARQL like
syntax and is pretty fast at what it does.

Often Jena already does what you need, sometimes you just need to ask if
the swiss army knife already has the desired attachment!

On that note also see SortedDataBag for non-memory bounded sorting of very
large arrays of triples, this is how ARQ handles sorting very large query
results without OOMing

Rob



On 6/26/13 2:53 PM, "Paul Houle" <on...@gmail.com> wrote:

>     I've had many requests to port some of the advances in my
>infovore framework to Jena and now I'm getting around to that.
>
>     My program Infovore at github
>
>https://github.com/paulhoule/infovore
>
>     has a module called "parallel super eyeball" which,  like the
>eyeball program,  checks a RDF file for trouble,  but does not crash
>when it finds it.  One simplifing trick was to accept only N-Triples
>and close variants,  such as the Freebase export files.  This means I
>can reliable break up triples into nodes by breaking on the first two
>bits of whitespace,  then parse the nodes separately.
>
>     I hacked away at the triple parser in Jena to produce something
>that parses a single node and I did it in a surgical way so there is a
>pretty good chance it is correct.  The result is here:
>
>https://github.com/paulhoule/infovore/tree/master/millipede/src/main/java/
>com/ontology2/rdf/parser
>
>     The real trouble with it is that it is terribly slow,  so slow
>that I was about to give up on it before introducing a parse cache,
>which is the function createNodeParseCache() in
>
>https://github.com/paulhoule/infovore/blob/master/millipede/src/main/java/
>com/ontology2/rdf/JenaUtil.java
>
>     This sped it up to the point where I was not motivated to work
>out how to speed it up,  but this work should happen.  I'm sure that
>the parser is doing a lot of set-up work,  some of which is
>superfluous,  and also I'm certain that a handwritten parser could be
>faster than the generated parser as well.  Seeing how many billion of
>triples there are ought there,  a handwritten node parser may be worth
>the effort.
>
>----
>
>    On another note,  I couldn't help but notice that it's easy to
>fill up memory with identical Node objects as seen in the following
>test:
>
>https://github.com/paulhoule/infovore/blob/master/millipede/src/test/java/
>com/ontology2/rdf/UnderstandNodeMemoryBehavior.java
>
>    Given that many graphs repeat the same node values a lot,  I wrote
>some Economizer classes,  tested in there,  that make a cache for
>recently created Node and Triple objects.  Perhaps I was being a bit
>silly to expect to sort very large arrays of triples in memory,  but I
>found I was greatly able to reduce the memory required by using
>"Economization".
>
>Any thoughts?