You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2020/11/30 15:34:29 UTC

Performance regressions in Jena and TDB2

Hello,

We're in the process of replacing an old server that was still running 
Fuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2 
from Jena 3.16.0 and a TDB2 store.

While testing the new server, I noticed that the new Fuseki is running a 
particular SPARQL query much slower than the old one. This is a query 
performed by Skosmos to find out all the letters of the alphabet for an 
alphabetical index by looking at all the skos:prefLabel values in a 
specific language. It's expected to be a bit slow (several seconds) 
since it needs to look at all the labels - but on the new server, the 
query is almost an order of magnitude slower, which is causing timeout 
issues.

To investigate this more closely, I decided to drop Fuseki out of the 
equation and just use Jena command line utilities. I wanted to compare 
the effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vs 
TDB2), and variations of the original SPARQL query. For the data, I used 
the newly published KANTO/finaf data set (an authority file of named 
entities, i.e. persons and organizations) which can be downloaded from 
finto.fi [1]. It has around 3M triples, 200k skos:Concept instances and 
the same number of skos:prefLabel values.

I loaded this into a TDB1 data set using Jena 3.7.0 (because of 
JENA-1575) like this:

apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graph 
http://example.org/finaf finaf-skos.ttl

Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:

apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graph 
http://example.org/finaf finaf-skos.ttl


This is the original SPARQL query:

	PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

	SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
	FROM <http://example.org/finaf>
	WHERE {
	  ?c a skos:Concept .
	  ?c skos:prefLabel ?label .
	  FILTER(langMatches(lang(?label), 'fi'))
	}

The query should return 68 results. There is no particular order since 
there is no ORDER BY, they are just a set of letters and special 
characters such as numbers and punctuation.

I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0 
and 3.16.0, using the options --time --repeat 2,10 (benchmark; two 
rounds of warming up, ten rounds of benchmarking) and wrote down the 
average query time, rounded to the first decimal point. I'm doing the 
benchmarks on an i5-7200U laptop with a pretty fast SSD.

Jena 3.8.0 / TDB1: 2.1s
Jena 3.16.0 / TDB1: 2.4s
Jena 3.8.0 / TDB2: 11.8s
Jena 3.16.0 / TDB2: 12.0s

The difference between Jena versions is not very significant, but TDB2 
is 5-6 times slower than TDB1. Here is how tdbquery -v explains the 
query on the TDB level:

17:06:07 INFO  exec            :: TDB
   (distinct
     (project (?l)
       (extend ((?l (ucase (str (substr ?label 1 1)))))
         (filter (langMatches (lang ?label) "fi")
           (bgp
             (triple ?c 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://www.w3.org/2004/02/skos/core#Concept>)
             (triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel> 
?label)
           )))))

The explanation is identical for tdb2.tdbquery so I won't repeat it.

I then looked at ways of optimizing the query to make it perform better. 
After trying many variations (for example reordering the clauses and/or 
moving the substring expression to a BIND variable), the only change 
that seemed to have a significant effect was to remove the FROM clause 
and instead insert a GRAPH clause targeting the same graph, like this:

	PREFIX skos: <http://www.w3.org/2004/02/skos/core#>

	SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
	WHERE {
	  GRAPH <http://example.org/finaf> {
	    ?c a skos:Concept .
	    ?c skos:prefLabel ?label .
	    FILTER(langMatches(lang(?label), 'fi'))
	  }
	}

Benchmark results for this GRAPH version of the query:

Jena 3.8.0 / TDB1: 0.9s
Jena 3.16.0 / TDB1: 1.3s
Jena 3.8.0 / TDB2: 1.4s
Jena 3.16.0 / TDB2: 1.9s

The results are much more even this time, though Jena 3.16.0 is about 
40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery 
-v (and tdb2.tdbquery -v) explains the query like this:

17:13:02 INFO  exec            :: TDB
   (distinct
     (project (?l)
       (extend ((?l (ucase (str (substr ?label 1 1)))))
         (filter (langMatches (lang ?label) "fi")
           (quadpattern
             (quad <http://example.org/finaf> ?c 
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
<http://www.w3.org/2004/02/skos/core#Concept>)
             (quad <http://example.org/finaf> ?c 
<http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
           )))))

The difference I see compared to the previous query is the use of "quad" 
instead of "triple". My understanding of operations on the TDB level is 
pretty naive, but it seems to me this is now targeting the correct graph 
directly, instead of indirectly, as in the first case. This is a bit 
surprising to me since the "FROM <http://example.org/finaf>" clause in 
the first query is, to me, saying the same thing as the GRAPH clause: 
just target triples in this particular graph. Is there a missed 
opportunity for some optimization here? Why is FROM (much) worse than GRAPH?

I also wonder why TDB2 is so much slower than TDB1, especially for the 
first version of the query. It should be an improvement, right? Of 
course there are trade-offs in implementing any complex system. But it 
makes me think whether we should stick to TDB1 for the time being, as 
there are no obvious benefits in using TDB2 for our current use.

Likewise, this makes me wonder whether there has been a mild decrease in 
performance between Jena 3.8.0 and 3.16.0 - though I didn't look at 
intermediate versions to pinpoint the exact change (or several) that 
would be causing the slowdown. If there's interest, I can try other 
versions as well.

For now we will probably just change Skosmos to use the GRAPH variant of 
the query, which should fix the immediate problems with timeouts. 
Unfortunately I don't have the skills to work directly on the ARQ 
optimizer or TDB2 code bases. But I'd be happy to test other variations 
and potential fixes to these performance problems.

Cheers,
Osma

[1] https://finto.fi/rest/v1/finaf/data?format=text/turtle


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Osma Suominen <os...@helsinki.fi>.
Replying to myself, as I did some follow-up tests.

Osma Suominen kirjoitti 4.12.2020 klo 18.42:
> Now this turned into a rather interesting exercise in using git bisect. 
> I was able to track down the change that caused the slowdown. It's this 
> merge commit:
> 
> [f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit 
> 'refs/pull/507/head' of https://github.com/apache/jena
> 
> which refers to this pull request:
> 
> https://github.com/apache/jena/pull/507
> 
> I don't have time for very deep analysis right now but it doesn't 
> surprise me that a substantial change to the query result serialization 
> slows down the queries.
> 
> Things to check: (mostly as a TODO list for myself)
> 
> 1. Does this depend on the query result format? For example, is only the 
> text format (default) slower than before?
> 2. Is there something suspicious in the PR 507 code that would explain 
> why it's so much slower?

This affects at least the CSV format too, so it's not just the text 
output format.

But I figured out that the real change here is simply that the warmup 
performed when using the --repeat parameter with two arguments has 
become less effective starting with Jena 3.10.0. When no warmup is used, 
the performance is the same for the different Jena versions.

And now that Andy implemented JENA-2007 which improves the warmup, I 
think the problem has already been solved.

Case closed.

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!

Andy Seaborne kirjoitti 2.12.2020 klo 13.01:
>>> There is no reason I can see why the special case of exactly one 
>>> "FROM" can't be handled specially. It masks all named graphs but is a 
>>> rewrite from triples, that will be fine.
>> Right. Is this worth opening a JIRA issue?
> If you want. It doesn't make it happen though; taht takes coding.

Understood. I won't open an issue right now but maybe later. But I doubt 
I'll be able to implement the change, at least not without some hints 
and perhaps substantial help.

> I replaced the SELECT line with a COUNT to see where the time is going.

OK.

>> Now there is a clear pattern: starting from Jena 3.10.0, the first 
>> timed run is much slower. So something happened there that makes the 
>> first full query (after the partial warmup) take much longer than it 
>> used to - but apparently not the subsequent ones which I timed yesterday.
>>
>> Any ideas what this could be about? Should this be investigated more?
> > It would be great if you would.

Now this turned into a rather interesting exercise in using git bisect. 
I was able to track down the change that caused the slowdown. It's this 
merge commit:

[f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit 
'refs/pull/507/head' of https://github.com/apache/jena

which refers to this pull request:

https://github.com/apache/jena/pull/507

I don't have time for very deep analysis right now but it doesn't 
surprise me that a substantial change to the query result serialization 
slows down the queries.

Things to check: (mostly as a TODO list for myself)

1. Does this depend on the query result format? For example, is only the 
text format (default) slower than before?
2. Is there something suspicious in the PR 507 code that would explain 
why it's so much slower?

I have no idea right now how significant this performance regression is. 
Judging from the fact that nobody has complained about it in almost 2 
years, it doesn't seem to be very bad.

> I have a change lined up to fix "--out none" to be "less optimized"; it 
> currently does nothing (very efficiently), it could consume silently the 
> query results.
> 
> It would also be better if the warmup with writing the required format 
> to /dev/null would also be better.

I see that you already did this - great work!

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Andy Seaborne <an...@apache.org>.

On 02/12/2020 10:07, Osma Suominen wrote:
> Hi Andy!
> 
> Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
>> There is no reason I can see why the special case of exactly one 
>> "FROM" can't be handled specially. It masks all named graphs but is a 
>> rewrite from triples, that will be fine.
> 
> Right. Is this worth opening a JIRA issue?

If you want. It doesn't make it happen though; taht takes coding.

> 
>>> TDB1: 0.825s
>>> TDB2: 1.127s
>>>
>>> No real difference to the quad version using GRAPH (see below).
>>
>> For me:
>> About 300-400ms is going in the SELECT expression.
> 
> Do you mean that it would be possible to further optimize the query 
> while getting the same result? I tested moving the SELECT expression 
> inside the query as a BIND, but it didn't seem to make any difference.

I replaced the SELECT line with a COUNT to see where the time is going.

>> The query timing has a bug - it does not preload the result set 
>> formatting during the warmup so classloading (and JIT but that is a 
>> lesser issue) happen during the main execution. Approximately, junk 
>> the first timed run.
>>
>> The rest go up and down because :
>> (1) other stuff on the machine going tick
>> (2) bits of successive JIT happening
> 
> That gave me an idea. I was wondering why there was such a clear 
> difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where 
> I used average query times after the warmup), but no difference in the 
> second one (where I used minimum query times).
> 
> I decided to look at the time for the first query after the partial 
> warmup - when you said "junk the first timed run" I did exactly the 
> opposite :) So I benchmarked the GRAPH variant of the query, on both 
> TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used 
> --repeat 1,1 and ran all the queries 5 times as separate invocations, 
> noting the minimum time. The results:

Lots of class loading and JIT ...

> 
> TDB1
> 
> 3.8.0: 1.0s
> 3.9.0: 1.0s
> 3.10.0: 3.5s
> 3.11.0: 3.4s
> 3.12.0: 3.6s
> 3.13.0: 3.2s
> 3.14.0: 3.2s
> 3.15.0: 3.2s
> 3.16.0: 3.1s
> 
> TDB2
> 
> 3.8.0: 1.4s
> 3.9.0: 1.4s
> 3.10.0: 5.4s
> 3.11.0: 5.7s
> 3.12.0: 5.6s
> 3.13.0: 5.9s
> 3.14.0: 5.4s
> 3.15.0: 5.6s
> 3.16.0: 5.7s
> 
> Now there is a clear pattern: starting from Jena 3.10.0, the first timed 
> run is much slower. So something happened there that makes the first 
> full query (after the partial warmup) take much longer than it used to - 
> but apparently not the subsequent ones which I timed yesterday.
> 
> Any ideas what this could be about? Should this be investigated more?

It would be great if you would.

I have a change lined up to fix "--out none" to be "less optimized"; it 
currently does nothing (very efficiently), it could consume silently the 
query results.

It would also be better if the warmup with writing the required format 
to /dev/null would also be better.

     Andy

> I 
> looked at the 3.10.0 release notes but I couldn't see anything obviously 
> related. Of course the release happened almost two years ago, so it's 
> been a while...
> 
> -Osma
> 

Re: Performance regressions in Jena and TDB2

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!

Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
> There is no reason I can see why the special case of exactly one "FROM" 
> can't be handled specially. It masks all named graphs but is a rewrite 
> from triples, that will be fine.

Right. Is this worth opening a JIRA issue?

>> TDB1: 0.825s
>> TDB2: 1.127s
>>
>> No real difference to the quad version using GRAPH (see below).
> 
> For me:
> About 300-400ms is going in the SELECT expression.

Do you mean that it would be possible to further optimize the query 
while getting the same result? I tested moving the SELECT expression 
inside the query as a BIND, but it didn't seem to make any difference.

> The query timing has a bug - it does not preload the result set 
> formatting during the warmup so classloading (and JIT but that is a 
> lesser issue) happen during the main execution. Approximately, junk the 
> first timed run.
> 
> The rest go up and down because :
> (1) other stuff on the machine going tick
> (2) bits of successive JIT happening

That gave me an idea. I was wondering why there was such a clear 
difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where 
I used average query times after the warmup), but no difference in the 
second one (where I used minimum query times).

I decided to look at the time for the first query after the partial 
warmup - when you said "junk the first timed run" I did exactly the 
opposite :) So I benchmarked the GRAPH variant of the query, on both 
TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used 
--repeat 1,1 and ran all the queries 5 times as separate invocations, 
noting the minimum time. The results:

TDB1

3.8.0: 1.0s
3.9.0: 1.0s
3.10.0: 3.5s
3.11.0: 3.4s
3.12.0: 3.6s
3.13.0: 3.2s
3.14.0: 3.2s
3.15.0: 3.2s
3.16.0: 3.1s

TDB2

3.8.0: 1.4s
3.9.0: 1.4s
3.10.0: 5.4s
3.11.0: 5.7s
3.12.0: 5.6s
3.13.0: 5.9s
3.14.0: 5.4s
3.15.0: 5.6s
3.16.0: 5.7s

Now there is a clear pattern: starting from Jena 3.10.0, the first timed 
run is much slower. So something happened there that makes the first 
full query (after the partial warmup) take much longer than it used to - 
but apparently not the subsequent ones which I timed yesterday.

Any ideas what this could be about? Should this be investigated more? I 
looked at the 3.10.0 release notes but I couldn't see anything obviously 
related. Of course the release happened almost two years ago, so it's 
been a while...

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Andy Seaborne <an...@apache.org>.

On 01/12/2020 14:50, Osma Suominen wrote:
> Hi Andy!
> 
> Thanks again for your insight on this.
> 
> Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
>> This is misused of FROM.
>>
>> Using FROM is not as simple as just selection one graph from the 
>> dataset. It is creating a complete new dataset comprised of graph 
>> picked out of a pool of graph in the TDB database.
>>
>> The app should use GRAPH if that is what it means!
> 
> It's a bit harsh to call this "misuse" of FROM, but I see your point. 
> I've been using SPARQL for over 10 years and I still get tripped up by 
> things that work (or not) in surprising ways...
> 
> Anyway I created a Skosmos PR [1] that switches this query to use GRAPH, 
> so the performance issue goes away. We may have to look at other queries 
> as well to see if they can be optimized by avoiding FROM - especially 
> now that we use TDB2 which seems to be more sensitive about this.

There is no reason I can see why the special case of exactly one "FROM" 
can't be handled specially. It masks all named graphs but is a rewrite 
from triples, that will be fine.

>> FROM is more.
>> FROM can be several graphs
>>
>> SELECT *
>> FROM <g1>
>> FROM <g2>
>> WHERE ..
>>
>> meaning that the default graph is the union view of g1 and g2.  And 
>> also that there are no named graphs (no use of FROM NAMED) so using 
>> GRAPH in the query will see nothing.
> 
>  From a naive user perspective, here are a few reasons to prefer FROM 
> over GRAPH:
> 
> 1. The syntax is more convenient and minimal. To target a specific 
> graph, add a FROM clause; to target the default graph, just remove it 
> (or comment out). GRAPH is more cumbersome since it adds another nesting 
> layer.
> 
> 2. It's more versatile, as you say (though in this case the query didn't 
> really make use of the additional flexibility)
> 
> 3. It looks more familiar if you know SQL.
> 
>> As to why there is a big difference - I don't know but the right thing 
>> is not to use FROM unless you want the union and hiding effects. It's 
>> about features not performance (the low level optimizer is being 
>> partially blocked - no quads).
> 
> I see, that's good to know - but IMHO not obvious just reading the 
> SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].
> 
> Just out of curiosity, I loaded the same data set into an RDF4J native 
> store (default settings) and tested both the FROM and GRAPH variants of 
> the query. There was no measurable difference in execution time: both 
> queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with 
> GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J 
> doesn't have such a big difference between FROM and GRAPH (but this 
> observation is based only on a single data point!).

RDF4J has a different graph storage. In a way, it always "FROM" like. I 
don't know if it is the same full FROM/FROM NAMED feature.

The spec is carefully worded by design to allow the different systems, 
RDF4J included to operate in a compliant manner. Systems already had 
different behaviours.

> 
>> TDB2 is not about being faster than TDB1 - it's about better scale of 
>> transactions, more robust and better handling of inline datatypes.
>> The TDB2 bulk loader is faster ... with enough hardware much faster 
>> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster 
>> for multibillions datasets.
>>
>> Query-wise they are comparable.
>>
>> There are two possible effects:
>>
>> If you did a straight bulk load of the two datasets then TDB1 is going 
>> to be a little faster because the dataset has a slightly better layout 
>> initially. However, if any updates happen, the difference will fade - 
>> it's also a reflection of the data base size (470Mi) and OS caching.
> 
> Understood. I also agree about TDB2 having a better design and 
> especially the improved transaction support is valuable. I don't think 
> we want to go back to TDB1.
> 
>> It would be interesting to see the query figures if you loaded as 
>> triples - the triples table is smaller than the quads table with one 
>> graph.
> 
> Tested this briefly using Jena 3.16.0 only. I loaded the file into the 
> default graph of both a TDB1 and a TDB2 store, and timed the query which 
> of course had neither FROM nor GRAPH. I noticed that there was quite a 
> bit of variance in the timings I did yesterday so I went for another 
> measurement approach this time - I used 10 repetitions (without warmup) 
> and wrote down the minimum time from the 10 query executions.
> 
> TDB1: 0.825s
> TDB2: 1.127s
> 
> No real difference to the quad version using GRAPH (see below).

For me:
About 300-400ms is going in the SELECT expression.

> 
>> Another effect is that DB2 has a better internal design with more 
>> abstraction and classes. It may take more time for the JIT to fully 
>> optimize.
> 
> When using --repeat 10, the first query takes much longer but after that 
> there is no obvious pattern for the next queries. Some are faster than 
> others, but the second one is already as fast as any of the others.

The query timing has a bug - it does not preload the result set 
formatting during the warmup so classloading (and JIT but that is a 
lesser issue) happen during the main execution. Approximately, junk the 
first timed run.

The rest go up and down because :
(1) other stuff on the machine going tick
(2) bits of successive JIT happening

> 
>>> Likewise, this makes me wonder whether there has been a mild decrease 
>>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look 
>>> at intermediate versions to pinpoint the exact change (or several) 
>>> that would be causing the slowdown. If there's interest, I can try 
>>> other versions as well. 
>>
>> Yes please. That would be useful to know.
> 
> OK, I did some additional measurements. I downloaded all releases from 
> 3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on 
> both TDB1 and TDB2 stores. Again this is the minimum time for 10 
> repetitions.
> 
> Sidenote: It's quite difficult to do accurate benchmarking on a laptop. 

Yes, isn't it :-)

There are other application both competign for RAM and also waking up in 
the background (Chrome! Thunderbird!), and the OS doing file system 
cache maintenance.

> I tried to eliminate interference from other applications, but there are 
> still aspects that are hard to control such as CPU frequency, which may 
> vary e.g. by temperature.
> 
> Here are the results:
> 
> TDB1
> 
> 3.8.0: 0.904s
> 3.9.0: 0.901s
> 3.10.0: 0.874s
> 3.11.0: 0.906s
> 3.12.0: 0.871s
> 3.13.0: 0.903s
> 3.14.0: 0.917s
> 3.15.0: 0.916s
> 3.16.0: 0.892s
> 
> 
> TDB2
> 
> 3.8.0: 1.300s
> 3.9.0: 1.169s
> 3.10.0: 1.319s
> 3.11.0: 1.147s
> 3.12.0: 1.264s
> 3.13.0: 1.108s
> 3.14.0: 1.158s
> 3.15.0: 1.323s
> 3.16.0: 1.285s
> 
> There is no clear difference between versions this time (though TDB2 is 
> a bit slower overall than TDB1). It looks like the 40% decrease in 
> performance between 3.8.0 and 3.16.0 that I reported yesterday was a 
> measurement error; though it's conceivable that the average query time 
> has increased even if the minimum times have not. Nevertheless I don't 
> think this requires further study.
> 
>> PS 1991-27-59 is not a date :-)
> 
> I know :) This is what happens when you convert 200k MARC records from a 
> legacy system into RDF - errors in the data that nobody noticed before 
> will suddenly crop up. There is also a bad URI value starting with a 
> colon, though Jena doesn't complain loudly about it (RDF4J chokes 
> though). These have already been fixed in the original records, so they 
> will be corrected soon in the RDF data set as well when we regenerate it 
> in the next few days.
> 
> -Osma
> 
> [1] https://github.com/NatLibFi/Skosmos/pull/1098
> 
> [2] https://jena.apache.org/tutorials/sparql_datasets.html
> 
> 
> 

Re: Performance regressions in Jena and TDB2

Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!

Thanks again for your insight on this.

Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
> This is misused of FROM.
> 
> Using FROM is not as simple as just selection one graph from the 
> dataset. It is creating a complete new dataset comprised of graph picked 
> out of a pool of graph in the TDB database.
> 
> The app should use GRAPH if that is what it means!

It's a bit harsh to call this "misuse" of FROM, but I see your point. 
I've been using SPARQL for over 10 years and I still get tripped up by 
things that work (or not) in surprising ways...

Anyway I created a Skosmos PR [1] that switches this query to use GRAPH, 
so the performance issue goes away. We may have to look at other queries 
as well to see if they can be optimized by avoiding FROM - especially 
now that we use TDB2 which seems to be more sensitive about this.

> FROM is more.
> FROM can be several graphs
> 
> SELECT *
> FROM <g1>
> FROM <g2>
> WHERE ..
> 
> meaning that the default graph is the union view of g1 and g2.  And also 
> that there are no named graphs (no use of FROM NAMED) so using GRAPH in 
> the query will see nothing.

 From a naive user perspective, here are a few reasons to prefer FROM 
over GRAPH:

1. The syntax is more convenient and minimal. To target a specific 
graph, add a FROM clause; to target the default graph, just remove it 
(or comment out). GRAPH is more cumbersome since it adds another nesting 
layer.

2. It's more versatile, as you say (though in this case the query didn't 
really make use of the additional flexibility)

3. It looks more familiar if you know SQL.

> As to why there is a big difference - I don't know but the right thing 
> is not to use FROM unless you want the union and hiding effects. It's 
> about features not performance (the low level optimizer is being 
> partially blocked - no quads).

I see, that's good to know - but IMHO not obvious just reading the 
SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].

Just out of curiosity, I loaded the same data set into an RDF4J native 
store (default settings) and tested both the FROM and GRAPH variants of 
the query. There was no measurable difference in execution time: both 
queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with 
GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J 
doesn't have such a big difference between FROM and GRAPH (but this 
observation is based only on a single data point!).

> TDB2 is not about being faster than TDB1 - it's about better scale of 
> transactions, more robust and better handling of inline datatypes.
> The TDB2 bulk loader is faster ... with enough hardware much faster 
> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for 
> multibillions datasets.
> 
> Query-wise they are comparable.
> 
> There are two possible effects:
> 
> If you did a straight bulk load of the two datasets then TDB1 is going 
> to be a little faster because the dataset has a slightly better layout 
> initially. However, if any updates happen, the difference will fade - 
> it's also a reflection of the data base size (470Mi) and OS caching.

Understood. I also agree about TDB2 having a better design and 
especially the improved transaction support is valuable. I don't think 
we want to go back to TDB1.

> It would be interesting to see the query figures if you loaded as 
> triples - the triples table is smaller than the quads table with one graph.

Tested this briefly using Jena 3.16.0 only. I loaded the file into the 
default graph of both a TDB1 and a TDB2 store, and timed the query which 
of course had neither FROM nor GRAPH. I noticed that there was quite a 
bit of variance in the timings I did yesterday so I went for another 
measurement approach this time - I used 10 repetitions (without warmup) 
and wrote down the minimum time from the 10 query executions.

TDB1: 0.825s
TDB2: 1.127s

No real difference to the quad version using GRAPH (see below).

> Another effect is that DB2 has a better internal design with more 
> abstraction and classes. It may take more time for the JIT to fully 
> optimize.

When using --repeat 10, the first query takes much longer but after that 
there is no obvious pattern for the next queries. Some are faster than 
others, but the second one is already as fast as any of the others.

>> Likewise, this makes me wonder whether there has been a mild decrease 
>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at 
>> intermediate versions to pinpoint the exact change (or several) that 
>> would be causing the slowdown. If there's interest, I can try other 
>> versions as well. 
> 
> Yes please. That would be useful to know.

OK, I did some additional measurements. I downloaded all releases from 
3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on 
both TDB1 and TDB2 stores. Again this is the minimum time for 10 
repetitions.

Sidenote: It's quite difficult to do accurate benchmarking on a laptop. 
I tried to eliminate interference from other applications, but there are 
still aspects that are hard to control such as CPU frequency, which may 
vary e.g. by temperature.

Here are the results:

TDB1

3.8.0: 0.904s
3.9.0: 0.901s
3.10.0: 0.874s
3.11.0: 0.906s
3.12.0: 0.871s
3.13.0: 0.903s
3.14.0: 0.917s
3.15.0: 0.916s
3.16.0: 0.892s


TDB2

3.8.0: 1.300s
3.9.0: 1.169s
3.10.0: 1.319s
3.11.0: 1.147s
3.12.0: 1.264s
3.13.0: 1.108s
3.14.0: 1.158s
3.15.0: 1.323s
3.16.0: 1.285s

There is no clear difference between versions this time (though TDB2 is 
a bit slower overall than TDB1). It looks like the 40% decrease in 
performance between 3.8.0 and 3.16.0 that I reported yesterday was a 
measurement error; though it's conceivable that the average query time 
has increased even if the minimum times have not. Nevertheless I don't 
think this requires further study.

> PS 1991-27-59 is not a date :-)

I know :) This is what happens when you convert 200k MARC records from a 
legacy system into RDF - errors in the data that nobody noticed before 
will suddenly crop up. There is also a bad URI value starting with a 
colon, though Jena doesn't complain loudly about it (RDF4J chokes 
though). These have already been fixed in the original records, so they 
will be corrected soon in the RDF data set as well when we regenerate it 
in the next few days.

-Osma

[1] https://github.com/NatLibFi/Skosmos/pull/1098

[2] https://jena.apache.org/tutorials/sparql_datasets.html



-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Andy Seaborne <an...@apache.org>.
Hi Osma,

This is misused of FROM.

Using FROM is not as simple as just selection one graph from the 
dataset. It is creating a complete new dataset comprised of graph picked 
out of a pool of graph in the TDB database.

The app should use GRAPH if that is what it means!

FROM is more.
FROM can be several graphs

SELECT *
FROM <g1>
FROM <g2>
WHERE ..

meaning that the default graph is the union view of g1 and g2.  And also 
that there are no named graphs (no use of FROM NAMED) so using GRAPH in 
the query will see nothing.

As to why there is a big difference - I don't know but the right thing 
is not to use FROM unless you want the union and hiding effects. It's 
about features not performance (the low level optimizer is being 
partially blocked - no quads).

TDB2 is not about being faster than TDB1 - it's about better scale of 
transactions, more robust and better handling of inline datatypes.
The TDB2 bulk loader is faster ... with enough hardware much faster 
(--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for 
multibillions datasets.

Query-wise they are comparable.

There are two possible effects:

If you did a straight bulk load of the two datasets then TDB1 is going 
to be a little faster because the dataset has a slightly better layout 
initially. However, if any updates happen, the difference will fade - 
it's also a reflection of the data base size (470Mi) and OS caching.

It would be interesting to see the query figures if you loaded as 
triples - the triples table is smaller than the quads table with one graph.

Another effect is that DB2 has a better internal design with more 
abstraction and classes. It may take more time for the JIT to fully 
optimize.

> Likewise, this makes me wonder whether there has been a mild decrease in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at intermediate versions to pinpoint the exact change (or several) that would be causing the slowdown. If there's interest, I can try other versions as well. 

Yes please. That would be useful to know.

     Andy

PS 1991-27-59 is not a date :-)


On 30/11/2020 15:34, Osma Suominen wrote:
> Hello,
> 
> We're in the process of replacing an old server that was still running 
> Fuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2 
> from Jena 3.16.0 and a TDB2 store.
> 
> While testing the new server, I noticed that the new Fuseki is running a 
> particular SPARQL query much slower than the old one. This is a query 
> performed by Skosmos to find out all the letters of the alphabet for an 
> alphabetical index by looking at all the skos:prefLabel values in a 
> specific language. It's expected to be a bit slow (several seconds) 
> since it needs to look at all the labels - but on the new server, the 
> query is almost an order of magnitude slower, which is causing timeout 
> issues.
> 
> To investigate this more closely, I decided to drop Fuseki out of the 
> equation and just use Jena command line utilities. I wanted to compare 
> the effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vs 
> TDB2), and variations of the original SPARQL query. For the data, I used 
> the newly published KANTO/finaf data set (an authority file of named 
> entities, i.e. persons and organizations) which can be downloaded from 
> finto.fi [1]. It has around 3M triples, 200k skos:Concept instances and 
> the same number of skos:prefLabel values.
> 
> I loaded this into a TDB1 data set using Jena 3.7.0 (because of 
> JENA-1575) like this:
> 
> apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graph 
> http://example.org/finaf finaf-skos.ttl
> 
> Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:
> 
> apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graph 
> http://example.org/finaf finaf-skos.ttl
> 
> 
> This is the original SPARQL query:
> 
>      PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> 
>      SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
>      FROM <http://example.org/finaf>
>      WHERE {
>        ?c a skos:Concept .
>        ?c skos:prefLabel ?label .
>        FILTER(langMatches(lang(?label), 'fi'))
>      }
> 
> The query should return 68 results. There is no particular order since 
> there is no ORDER BY, they are just a set of letters and special 
> characters such as numbers and punctuation.
> 
> I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0 
> and 3.16.0, using the options --time --repeat 2,10 (benchmark; two 
> rounds of warming up, ten rounds of benchmarking) and wrote down the 
> average query time, rounded to the first decimal point. I'm doing the 
> benchmarks on an i5-7200U laptop with a pretty fast SSD.
> 
> Jena 3.8.0 / TDB1: 2.1s
> Jena 3.16.0 / TDB1: 2.4s
> Jena 3.8.0 / TDB2: 11.8s
> Jena 3.16.0 / TDB2: 12.0s
> 
> The difference between Jena versions is not very significant, but TDB2 
> is 5-6 times slower than TDB1. Here is how tdbquery -v explains the 
> query on the TDB level:
> 
> 17:06:07 INFO  exec            :: TDB
>    (distinct
>      (project (?l)
>        (extend ((?l (ucase (str (substr ?label 1 1)))))
>          (filter (langMatches (lang ?label) "fi")
>            (bgp
>              (triple ?c 
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
> <http://www.w3.org/2004/02/skos/core#Concept>)
>              (triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel> 
> ?label)
>            )))))
> 
> The explanation is identical for tdb2.tdbquery so I won't repeat it.
> 
> I then looked at ways of optimizing the query to make it perform better. 
> After trying many variations (for example reordering the clauses and/or 
> moving the substring expression to a BIND variable), the only change 
> that seemed to have a significant effect was to remove the FROM clause 
> and instead insert a GRAPH clause targeting the same graph, like this:
> 
>      PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
> 
>      SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
>      WHERE {
>        GRAPH <http://example.org/finaf> {
>          ?c a skos:Concept .
>          ?c skos:prefLabel ?label .
>          FILTER(langMatches(lang(?label), 'fi'))
>        }
>      }
> 
> Benchmark results for this GRAPH version of the query:
> 
> Jena 3.8.0 / TDB1: 0.9s
> Jena 3.16.0 / TDB1: 1.3s
> Jena 3.8.0 / TDB2: 1.4s
> Jena 3.16.0 / TDB2: 1.9s
> 
> The results are much more even this time, though Jena 3.16.0 is about 
> 40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery 
> -v (and tdb2.tdbquery -v) explains the query like this:
> 
> 17:13:02 INFO  exec            :: TDB
>    (distinct
>      (project (?l)
>        (extend ((?l (ucase (str (substr ?label 1 1)))))
>          (filter (langMatches (lang ?label) "fi")
>            (quadpattern
>              (quad <http://example.org/finaf> ?c 
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type> 
> <http://www.w3.org/2004/02/skos/core#Concept>)
>              (quad <http://example.org/finaf> ?c 
> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
>            )))))
> 
> The difference I see compared to the previous query is the use of "quad" 
> instead of "triple". My understanding of operations on the TDB level is 
> pretty naive, but it seems to me this is now targeting the correct graph 
> directly, instead of indirectly, as in the first case. This is a bit 
> surprising to me since the "FROM <http://example.org/finaf>" clause in 
> the first query is, to me, saying the same thing as the GRAPH clause: 
> just target triples in this particular graph. Is there a missed 
> opportunity for some optimization here? Why is FROM (much) worse than 
> GRAPH?
> 
> I also wonder why TDB2 is so much slower than TDB1, especially for the 
> first version of the query. It should be an improvement, right? Of 
> course there are trade-offs in implementing any complex system. But it 
> makes me think whether we should stick to TDB1 for the time being, as 
> there are no obvious benefits in using TDB2 for our current use.
> 
> Likewise, this makes me wonder whether there has been a mild decrease in 
> performance between Jena 3.8.0 and 3.16.0 - though I didn't look at 
> intermediate versions to pinpoint the exact change (or several) that 
> would be causing the slowdown. If there's interest, I can try other 
> versions as well.
> 
> For now we will probably just change Skosmos to use the GRAPH variant of 
> the query, which should fix the immediate problems with timeouts. 
> Unfortunately I don't have the skills to work directly on the ARQ 
> optimizer or TDB2 code bases. But I'd be happy to test other variations 
> and potential fixes to these performance problems.
> 
> Cheers,
> Osma
> 
> [1] https://finto.fi/rest/v1/finaf/data?format=text/turtle
> 
>