You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@jena.apache.org by "A. Soroka" <aj...@virginia.edu> on 2015/09/26 23:28:51 UTC

Re: Timing tests for jena-624: even a little better

Sorry for spamming the list a bit today, but before COB I wanted to offer some more figures on this effort. Using a port of Scala’s immutable collections [*] in a new branch [**] the new implementation is now seeing a little better than half the load performance of the “stock” impl (see below sig). Of course these figures are very rough, but hopefully they demonstrate motion in the right direction. I still intend to try out Clojure’s collections, but I think I’m a lot closer to a realistic level of performance. I hope to demonstrate something about the query performance here soon.

[*] https://github.com/andrewoma/dexx

[**] https://github.com/ajs6f/jena/tree/jena-624-dexx

Anyone who is interested in examining these branches should be aware that they are currently moving targets— commits several times a day.

---
A. Soroka
The University of Virginia Library



Running org.apache.jena.sparql.core.mem.PerfTest
==== Data: /Users/ajs6f/Documents/jena/bsbm-1m.nt.gz ====
    Size: 1,000,312 (2.978s, 335,900 tps)
==== DSG/mix/auto (warm N=3)
==== DSG/mix/txn  (warm N=3)
==== DSG/mem/auto (warm N=3)
==== DSG/mem/txn  (warm N=3)
==== DSG/mix/auto (N=20)
==== DSG/mix/auto (N=20) Time: 97.761s (204,644 tps)
==== DSG/mix/txn  (N=20)
==== DSG/mix/txn  (N=20) Time: 101.668s (196,780 tps)
==== DSG/mem/auto (N=20)
==== DSG/mem/auto (N=20) Time: 211.971s (94,381 tps)
==== DSG/mem/txn  (N=20)
==== DSG/mem/txn  (N=20) Time: 151.359s (132,177 tps)

> On Sep 26, 2015, at 1:31 PM, A. Soroka <aj...@email.virginia.edu> wrote:
> 
> I’ve committed the change to using separate triple and quad indexes (via DatasetGraphTriplesQuads). There appears to be definite and significant improvement, from Andy’s numbers showing the current implementation getting 5 times the load performance of the new implementation to my numbers (below) which show the new impl improved so that the current impl is at maybe 2.5 times its performance. Thanks for that advice, Andy! 
> 
> I’ll probably take a look next at moving to a more powerful library for persistent structures that might either perform better raw or offer finer control over tree creation as discussed above in this thread.
> 
> On a related note, are there any Jena standard parts for query testing for this kind of situation? I know that BSBM has several sophisticated suites of tests defined, but are any of them considered particularly appropriate, or has anyone out there in dev-land built their own harness for BSBM or something else that I could “borrow”? {grin}
> 
> — 
> A. Soroka
> The University of Virginia Library
> 
> === Data: /Users/ajs6f/Documents/jena/bsbm-1m.nt.gz ====
>    Size: 1,000,312 (2.947s, 339,434 tps)
> ==== DSG/mix/auto (warm N=3)
> ==== DSG/mix/txn  (warm N=3)
> ==== DSG/mem/auto (warm N=3)
> ==== DSG/mem/txn  (warm N=3)
> ==== DSG/mix/auto (N=20)
> ==== DSG/mix/auto (N=20) Time: 108.331s (184,676 tps)
> ==== DSG/mix/txn  (N=20)
> ==== DSG/mix/txn  (N=20) Time: 105.424s (189,769 tps)
> ==== DSG/mem/auto (N=20)
> ==== DSG/mem/auto (N=20) Time: 283.680s (70,523 tps)
> ==== DSG/mem/txn  (N=20)
> ==== DSG/mem/txn  (N=20) Time: 224.501s (89,114 tps)
> 
>> On Sep 26, 2015, at 9:21 AM, Andy Seaborne <an...@apache.org> wrote:
>> 
>> On 26/09/15 12:07, A. Soroka wrote:
>>> Ooh! Those numbers are awful.
>> 
>> Early days. The general purpose dataset has no features.   And, of course, a concurrent read is completely blocked - that's a major issue for some usages.
>> 
>> Access performance, having update not block query, in a very reliable implementation is a valuable thing to have. And if it is described as a "complete temporal database", it is all a good thing.  Marketing.
>> 
>> The storage implementation is now a self-contained thing to look at. ... seems there is no shortage of options ... google quickly got me:
>> 
>> http://stackoverflow.com/questions/8575723/whats-a-good-persistent-collections-framework-for-use-in-java
>> 
>> and there are more.  Various data structures I have not heard of before.
>> 
>>> Per your point 2, it does create a new
>>> tree per add/remove. And PCollections’ bulk operations are just loops
>>> over the single-element operations, so trying to accumulate data and
>>> use a single operation will create the same number of trees.
>>> Unfortunately, PCollections does not have something like Clojure’s
>>> transient operations [*], where under carefully-controlled conditions
>>> a normally persistent structure can be mutated in place for celerity
>>> of operation. I have no commitment to PCollections, and I can switch
>>> and see what happens with Clojure and transiency. But I should first
>>> go back over the code with a fine-toothed comb and make sure that
>>> there isn’t a plain old mistake of some kind.
>>> 
>>> As far as the indexes, I’m not quite sure what you mean by
>>> “triples+quads”. Do you mean a single map from graph name to  three
>>> triple-covering indexes? Something like Map<Node, TripleIndex>, with
>>> TripleIndex having within it three covering indexes for triples in
>>> the way that current HexIndex has within it six covering indexes for
>>> quads?
>> 
>> That's one way - I meant using the supporting framework in DatasetGraphTriplesQuads so
>> 
>> DatasetGraphQuads => DatasetGraphTriplesQuads
>> 
>> The default graph is handled separately from named graphs.
>> 
>> TDB uses this - there is a triple table (dft: 3 index) and a quads table (dft: 6 index)
>> 
>> 	Andy
>> 
>>> 
>>> --- A. Soroka The University of Virginia Library
>>> 
>>> [*] http://clojure.org/transients
>>> 
>>>> On Sep 26, 2015, at 6:42 AM, Andy Seaborne <an...@apache.org>
>>>> wrote:
>>>> 
>>>> Some thoughts:
>>>> 
>>>> 1/ If it were a triples+quads design (TripleTable, QuadTable) , not
>>>> just quads, there would be 3 indexes not 6 for triples so 2x
>>>> faster.
>>>> 
>>>> 2/ As autocommit and txn forms are nearly the same, I guess that
>>>> every add(Quad) is causing a new pcollections tree for each index.
>>>> 
>>>> I don't know pcollections but is it possible to use it so a
>>>> independent tree is created only at begin(W). i.e. copy-to-root
>>>> does not happen on stuff updated already touched after begin(W).
>>>> 
>>>> Andy
>>> 
>> 
>