You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2020/11/30 15:34:29 UTC
Performance regressions in Jena and TDB2
Hello,
We're in the process of replacing an old server that was still running
Fuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2
from Jena 3.16.0 and a TDB2 store.
While testing the new server, I noticed that the new Fuseki is running a
particular SPARQL query much slower than the old one. This is a query
performed by Skosmos to find out all the letters of the alphabet for an
alphabetical index by looking at all the skos:prefLabel values in a
specific language. It's expected to be a bit slow (several seconds)
since it needs to look at all the labels - but on the new server, the
query is almost an order of magnitude slower, which is causing timeout
issues.
To investigate this more closely, I decided to drop Fuseki out of the
equation and just use Jena command line utilities. I wanted to compare
the effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vs
TDB2), and variations of the original SPARQL query. For the data, I used
the newly published KANTO/finaf data set (an authority file of named
entities, i.e. persons and organizations) which can be downloaded from
finto.fi [1]. It has around 3M triples, 200k skos:Concept instances and
the same number of skos:prefLabel values.
I loaded this into a TDB1 data set using Jena 3.7.0 (because of
JENA-1575) like this:
apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graph
http://example.org/finaf finaf-skos.ttl
Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:
apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graph
http://example.org/finaf finaf-skos.ttl
This is the original SPARQL query:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
FROM <http://example.org/finaf>
WHERE {
?c a skos:Concept .
?c skos:prefLabel ?label .
FILTER(langMatches(lang(?label), 'fi'))
}
The query should return 68 results. There is no particular order since
there is no ORDER BY, they are just a set of letters and special
characters such as numbers and punctuation.
I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0
and 3.16.0, using the options --time --repeat 2,10 (benchmark; two
rounds of warming up, ten rounds of benchmarking) and wrote down the
average query time, rounded to the first decimal point. I'm doing the
benchmarks on an i5-7200U laptop with a pretty fast SSD.
Jena 3.8.0 / TDB1: 2.1s
Jena 3.16.0 / TDB1: 2.4s
Jena 3.8.0 / TDB2: 11.8s
Jena 3.16.0 / TDB2: 12.0s
The difference between Jena versions is not very significant, but TDB2
is 5-6 times slower than TDB1. Here is how tdbquery -v explains the
query on the TDB level:
17:06:07 INFO exec :: TDB
(distinct
(project (?l)
(extend ((?l (ucase (str (substr ?label 1 1)))))
(filter (langMatches (lang ?label) "fi")
(bgp
(triple ?c
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2004/02/skos/core#Concept>)
(triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel>
?label)
)))))
The explanation is identical for tdb2.tdbquery so I won't repeat it.
I then looked at ways of optimizing the query to make it perform better.
After trying many variations (for example reordering the clauses and/or
moving the substring expression to a BIND variable), the only change
that seemed to have a significant effect was to remove the FROM clause
and instead insert a GRAPH clause targeting the same graph, like this:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
WHERE {
GRAPH <http://example.org/finaf> {
?c a skos:Concept .
?c skos:prefLabel ?label .
FILTER(langMatches(lang(?label), 'fi'))
}
}
Benchmark results for this GRAPH version of the query:
Jena 3.8.0 / TDB1: 0.9s
Jena 3.16.0 / TDB1: 1.3s
Jena 3.8.0 / TDB2: 1.4s
Jena 3.16.0 / TDB2: 1.9s
The results are much more even this time, though Jena 3.16.0 is about
40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery
-v (and tdb2.tdbquery -v) explains the query like this:
17:13:02 INFO exec :: TDB
(distinct
(project (?l)
(extend ((?l (ucase (str (substr ?label 1 1)))))
(filter (langMatches (lang ?label) "fi")
(quadpattern
(quad <http://example.org/finaf> ?c
<http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
<http://www.w3.org/2004/02/skos/core#Concept>)
(quad <http://example.org/finaf> ?c
<http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
)))))
The difference I see compared to the previous query is the use of "quad"
instead of "triple". My understanding of operations on the TDB level is
pretty naive, but it seems to me this is now targeting the correct graph
directly, instead of indirectly, as in the first case. This is a bit
surprising to me since the "FROM <http://example.org/finaf>" clause in
the first query is, to me, saying the same thing as the GRAPH clause:
just target triples in this particular graph. Is there a missed
opportunity for some optimization here? Why is FROM (much) worse than GRAPH?
I also wonder why TDB2 is so much slower than TDB1, especially for the
first version of the query. It should be an improvement, right? Of
course there are trade-offs in implementing any complex system. But it
makes me think whether we should stick to TDB1 for the time being, as
there are no obvious benefits in using TDB2 for our current use.
Likewise, this makes me wonder whether there has been a mild decrease in
performance between Jena 3.8.0 and 3.16.0 - though I didn't look at
intermediate versions to pinpoint the exact change (or several) that
would be causing the slowdown. If there's interest, I can try other
versions as well.
For now we will probably just change Skosmos to use the GRAPH variant of
the query, which should fix the immediate problems with timeouts.
Unfortunately I don't have the skills to work directly on the ARQ
optimizer or TDB2 code bases. But I'd be happy to test other variations
and potential fixes to these performance problems.
Cheers,
Osma
[1] https://finto.fi/rest/v1/finaf/data?format=text/turtle
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Osma Suominen <os...@helsinki.fi>.
Replying to myself, as I did some follow-up tests.
Osma Suominen kirjoitti 4.12.2020 klo 18.42:
> Now this turned into a rather interesting exercise in using git bisect.
> I was able to track down the change that caused the slowdown. It's this
> merge commit:
>
> [f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit
> 'refs/pull/507/head' of https://github.com/apache/jena
>
> which refers to this pull request:
>
> https://github.com/apache/jena/pull/507
>
> I don't have time for very deep analysis right now but it doesn't
> surprise me that a substantial change to the query result serialization
> slows down the queries.
>
> Things to check: (mostly as a TODO list for myself)
>
> 1. Does this depend on the query result format? For example, is only the
> text format (default) slower than before?
> 2. Is there something suspicious in the PR 507 code that would explain
> why it's so much slower?
This affects at least the CSV format too, so it's not just the text
output format.
But I figured out that the real change here is simply that the warmup
performed when using the --repeat parameter with two arguments has
become less effective starting with Jena 3.10.0. When no warmup is used,
the performance is the same for the different Jena versions.
And now that Andy implemented JENA-2007 which improves the warmup, I
think the problem has already been solved.
Case closed.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!
Andy Seaborne kirjoitti 2.12.2020 klo 13.01:
>>> There is no reason I can see why the special case of exactly one
>>> "FROM" can't be handled specially. It masks all named graphs but is a
>>> rewrite from triples, that will be fine.
>> Right. Is this worth opening a JIRA issue?
> If you want. It doesn't make it happen though; taht takes coding.
Understood. I won't open an issue right now but maybe later. But I doubt
I'll be able to implement the change, at least not without some hints
and perhaps substantial help.
> I replaced the SELECT line with a COUNT to see where the time is going.
OK.
>> Now there is a clear pattern: starting from Jena 3.10.0, the first
>> timed run is much slower. So something happened there that makes the
>> first full query (after the partial warmup) take much longer than it
>> used to - but apparently not the subsequent ones which I timed yesterday.
>>
>> Any ideas what this could be about? Should this be investigated more?
> > It would be great if you would.
Now this turned into a rather interesting exercise in using git bisect.
I was able to track down the change that caused the slowdown. It's this
merge commit:
[f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit
'refs/pull/507/head' of https://github.com/apache/jena
which refers to this pull request:
https://github.com/apache/jena/pull/507
I don't have time for very deep analysis right now but it doesn't
surprise me that a substantial change to the query result serialization
slows down the queries.
Things to check: (mostly as a TODO list for myself)
1. Does this depend on the query result format? For example, is only the
text format (default) slower than before?
2. Is there something suspicious in the PR 507 code that would explain
why it's so much slower?
I have no idea right now how significant this performance regression is.
Judging from the fact that nobody has complained about it in almost 2
years, it doesn't seem to be very bad.
> I have a change lined up to fix "--out none" to be "less optimized"; it
> currently does nothing (very efficiently), it could consume silently the
> query results.
>
> It would also be better if the warmup with writing the required format
> to /dev/null would also be better.
I see that you already did this - great work!
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Andy Seaborne <an...@apache.org>.
On 02/12/2020 10:07, Osma Suominen wrote:
> Hi Andy!
>
> Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
>> There is no reason I can see why the special case of exactly one
>> "FROM" can't be handled specially. It masks all named graphs but is a
>> rewrite from triples, that will be fine.
>
> Right. Is this worth opening a JIRA issue?
If you want. It doesn't make it happen though; taht takes coding.
>
>>> TDB1: 0.825s
>>> TDB2: 1.127s
>>>
>>> No real difference to the quad version using GRAPH (see below).
>>
>> For me:
>> About 300-400ms is going in the SELECT expression.
>
> Do you mean that it would be possible to further optimize the query
> while getting the same result? I tested moving the SELECT expression
> inside the query as a BIND, but it didn't seem to make any difference.
I replaced the SELECT line with a COUNT to see where the time is going.
>> The query timing has a bug - it does not preload the result set
>> formatting during the warmup so classloading (and JIT but that is a
>> lesser issue) happen during the main execution. Approximately, junk
>> the first timed run.
>>
>> The rest go up and down because :
>> (1) other stuff on the machine going tick
>> (2) bits of successive JIT happening
>
> That gave me an idea. I was wondering why there was such a clear
> difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where
> I used average query times after the warmup), but no difference in the
> second one (where I used minimum query times).
>
> I decided to look at the time for the first query after the partial
> warmup - when you said "junk the first timed run" I did exactly the
> opposite :) So I benchmarked the GRAPH variant of the query, on both
> TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used
> --repeat 1,1 and ran all the queries 5 times as separate invocations,
> noting the minimum time. The results:
Lots of class loading and JIT ...
>
> TDB1
>
> 3.8.0: 1.0s
> 3.9.0: 1.0s
> 3.10.0: 3.5s
> 3.11.0: 3.4s
> 3.12.0: 3.6s
> 3.13.0: 3.2s
> 3.14.0: 3.2s
> 3.15.0: 3.2s
> 3.16.0: 3.1s
>
> TDB2
>
> 3.8.0: 1.4s
> 3.9.0: 1.4s
> 3.10.0: 5.4s
> 3.11.0: 5.7s
> 3.12.0: 5.6s
> 3.13.0: 5.9s
> 3.14.0: 5.4s
> 3.15.0: 5.6s
> 3.16.0: 5.7s
>
> Now there is a clear pattern: starting from Jena 3.10.0, the first timed
> run is much slower. So something happened there that makes the first
> full query (after the partial warmup) take much longer than it used to -
> but apparently not the subsequent ones which I timed yesterday.
>
> Any ideas what this could be about? Should this be investigated more?
It would be great if you would.
I have a change lined up to fix "--out none" to be "less optimized"; it
currently does nothing (very efficiently), it could consume silently the
query results.
It would also be better if the warmup with writing the required format
to /dev/null would also be better.
Andy
> I
> looked at the 3.10.0 release notes but I couldn't see anything obviously
> related. Of course the release happened almost two years ago, so it's
> been a while...
>
> -Osma
>
Re: Performance regressions in Jena and TDB2
Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!
Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
> There is no reason I can see why the special case of exactly one "FROM"
> can't be handled specially. It masks all named graphs but is a rewrite
> from triples, that will be fine.
Right. Is this worth opening a JIRA issue?
>> TDB1: 0.825s
>> TDB2: 1.127s
>>
>> No real difference to the quad version using GRAPH (see below).
>
> For me:
> About 300-400ms is going in the SELECT expression.
Do you mean that it would be possible to further optimize the query
while getting the same result? I tested moving the SELECT expression
inside the query as a BIND, but it didn't seem to make any difference.
> The query timing has a bug - it does not preload the result set
> formatting during the warmup so classloading (and JIT but that is a
> lesser issue) happen during the main execution. Approximately, junk the
> first timed run.
>
> The rest go up and down because :
> (1) other stuff on the machine going tick
> (2) bits of successive JIT happening
That gave me an idea. I was wondering why there was such a clear
difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where
I used average query times after the warmup), but no difference in the
second one (where I used minimum query times).
I decided to look at the time for the first query after the partial
warmup - when you said "junk the first timed run" I did exactly the
opposite :) So I benchmarked the GRAPH variant of the query, on both
TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used
--repeat 1,1 and ran all the queries 5 times as separate invocations,
noting the minimum time. The results:
TDB1
3.8.0: 1.0s
3.9.0: 1.0s
3.10.0: 3.5s
3.11.0: 3.4s
3.12.0: 3.6s
3.13.0: 3.2s
3.14.0: 3.2s
3.15.0: 3.2s
3.16.0: 3.1s
TDB2
3.8.0: 1.4s
3.9.0: 1.4s
3.10.0: 5.4s
3.11.0: 5.7s
3.12.0: 5.6s
3.13.0: 5.9s
3.14.0: 5.4s
3.15.0: 5.6s
3.16.0: 5.7s
Now there is a clear pattern: starting from Jena 3.10.0, the first timed
run is much slower. So something happened there that makes the first
full query (after the partial warmup) take much longer than it used to -
but apparently not the subsequent ones which I timed yesterday.
Any ideas what this could be about? Should this be investigated more? I
looked at the 3.10.0 release notes but I couldn't see anything obviously
related. Of course the release happened almost two years ago, so it's
been a while...
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Andy Seaborne <an...@apache.org>.
On 01/12/2020 14:50, Osma Suominen wrote:
> Hi Andy!
>
> Thanks again for your insight on this.
>
> Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
>> This is misused of FROM.
>>
>> Using FROM is not as simple as just selection one graph from the
>> dataset. It is creating a complete new dataset comprised of graph
>> picked out of a pool of graph in the TDB database.
>>
>> The app should use GRAPH if that is what it means!
>
> It's a bit harsh to call this "misuse" of FROM, but I see your point.
> I've been using SPARQL for over 10 years and I still get tripped up by
> things that work (or not) in surprising ways...
>
> Anyway I created a Skosmos PR [1] that switches this query to use GRAPH,
> so the performance issue goes away. We may have to look at other queries
> as well to see if they can be optimized by avoiding FROM - especially
> now that we use TDB2 which seems to be more sensitive about this.
There is no reason I can see why the special case of exactly one "FROM"
can't be handled specially. It masks all named graphs but is a rewrite
from triples, that will be fine.
>> FROM is more.
>> FROM can be several graphs
>>
>> SELECT *
>> FROM <g1>
>> FROM <g2>
>> WHERE ..
>>
>> meaning that the default graph is the union view of g1 and g2. And
>> also that there are no named graphs (no use of FROM NAMED) so using
>> GRAPH in the query will see nothing.
>
> From a naive user perspective, here are a few reasons to prefer FROM
> over GRAPH:
>
> 1. The syntax is more convenient and minimal. To target a specific
> graph, add a FROM clause; to target the default graph, just remove it
> (or comment out). GRAPH is more cumbersome since it adds another nesting
> layer.
>
> 2. It's more versatile, as you say (though in this case the query didn't
> really make use of the additional flexibility)
>
> 3. It looks more familiar if you know SQL.
>
>> As to why there is a big difference - I don't know but the right thing
>> is not to use FROM unless you want the union and hiding effects. It's
>> about features not performance (the low level optimizer is being
>> partially blocked - no quads).
>
> I see, that's good to know - but IMHO not obvious just reading the
> SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].
>
> Just out of curiosity, I loaded the same data set into an RDF4J native
> store (default settings) and tested both the FROM and GRAPH variants of
> the query. There was no measurable difference in execution time: both
> queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with
> GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J
> doesn't have such a big difference between FROM and GRAPH (but this
> observation is based only on a single data point!).
RDF4J has a different graph storage. In a way, it always "FROM" like. I
don't know if it is the same full FROM/FROM NAMED feature.
The spec is carefully worded by design to allow the different systems,
RDF4J included to operate in a compliant manner. Systems already had
different behaviours.
>
>> TDB2 is not about being faster than TDB1 - it's about better scale of
>> transactions, more robust and better handling of inline datatypes.
>> The TDB2 bulk loader is faster ... with enough hardware much faster
>> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster
>> for multibillions datasets.
>>
>> Query-wise they are comparable.
>>
>> There are two possible effects:
>>
>> If you did a straight bulk load of the two datasets then TDB1 is going
>> to be a little faster because the dataset has a slightly better layout
>> initially. However, if any updates happen, the difference will fade -
>> it's also a reflection of the data base size (470Mi) and OS caching.
>
> Understood. I also agree about TDB2 having a better design and
> especially the improved transaction support is valuable. I don't think
> we want to go back to TDB1.
>
>> It would be interesting to see the query figures if you loaded as
>> triples - the triples table is smaller than the quads table with one
>> graph.
>
> Tested this briefly using Jena 3.16.0 only. I loaded the file into the
> default graph of both a TDB1 and a TDB2 store, and timed the query which
> of course had neither FROM nor GRAPH. I noticed that there was quite a
> bit of variance in the timings I did yesterday so I went for another
> measurement approach this time - I used 10 repetitions (without warmup)
> and wrote down the minimum time from the 10 query executions.
>
> TDB1: 0.825s
> TDB2: 1.127s
>
> No real difference to the quad version using GRAPH (see below).
For me:
About 300-400ms is going in the SELECT expression.
>
>> Another effect is that DB2 has a better internal design with more
>> abstraction and classes. It may take more time for the JIT to fully
>> optimize.
>
> When using --repeat 10, the first query takes much longer but after that
> there is no obvious pattern for the next queries. Some are faster than
> others, but the second one is already as fast as any of the others.
The query timing has a bug - it does not preload the result set
formatting during the warmup so classloading (and JIT but that is a
lesser issue) happen during the main execution. Approximately, junk the
first timed run.
The rest go up and down because :
(1) other stuff on the machine going tick
(2) bits of successive JIT happening
>
>>> Likewise, this makes me wonder whether there has been a mild decrease
>>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look
>>> at intermediate versions to pinpoint the exact change (or several)
>>> that would be causing the slowdown. If there's interest, I can try
>>> other versions as well.
>>
>> Yes please. That would be useful to know.
>
> OK, I did some additional measurements. I downloaded all releases from
> 3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on
> both TDB1 and TDB2 stores. Again this is the minimum time for 10
> repetitions.
>
> Sidenote: It's quite difficult to do accurate benchmarking on a laptop.
Yes, isn't it :-)
There are other application both competign for RAM and also waking up in
the background (Chrome! Thunderbird!), and the OS doing file system
cache maintenance.
> I tried to eliminate interference from other applications, but there are
> still aspects that are hard to control such as CPU frequency, which may
> vary e.g. by temperature.
>
> Here are the results:
>
> TDB1
>
> 3.8.0: 0.904s
> 3.9.0: 0.901s
> 3.10.0: 0.874s
> 3.11.0: 0.906s
> 3.12.0: 0.871s
> 3.13.0: 0.903s
> 3.14.0: 0.917s
> 3.15.0: 0.916s
> 3.16.0: 0.892s
>
>
> TDB2
>
> 3.8.0: 1.300s
> 3.9.0: 1.169s
> 3.10.0: 1.319s
> 3.11.0: 1.147s
> 3.12.0: 1.264s
> 3.13.0: 1.108s
> 3.14.0: 1.158s
> 3.15.0: 1.323s
> 3.16.0: 1.285s
>
> There is no clear difference between versions this time (though TDB2 is
> a bit slower overall than TDB1). It looks like the 40% decrease in
> performance between 3.8.0 and 3.16.0 that I reported yesterday was a
> measurement error; though it's conceivable that the average query time
> has increased even if the minimum times have not. Nevertheless I don't
> think this requires further study.
>
>> PS 1991-27-59 is not a date :-)
>
> I know :) This is what happens when you convert 200k MARC records from a
> legacy system into RDF - errors in the data that nobody noticed before
> will suddenly crop up. There is also a bad URI value starting with a
> colon, though Jena doesn't complain loudly about it (RDF4J chokes
> though). These have already been fixed in the original records, so they
> will be corrected soon in the RDF data set as well when we regenerate it
> in the next few days.
>
> -Osma
>
> [1] https://github.com/NatLibFi/Skosmos/pull/1098
>
> [2] https://jena.apache.org/tutorials/sparql_datasets.html
>
>
>
Re: Performance regressions in Jena and TDB2
Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!
Thanks again for your insight on this.
Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
> This is misused of FROM.
>
> Using FROM is not as simple as just selection one graph from the
> dataset. It is creating a complete new dataset comprised of graph picked
> out of a pool of graph in the TDB database.
>
> The app should use GRAPH if that is what it means!
It's a bit harsh to call this "misuse" of FROM, but I see your point.
I've been using SPARQL for over 10 years and I still get tripped up by
things that work (or not) in surprising ways...
Anyway I created a Skosmos PR [1] that switches this query to use GRAPH,
so the performance issue goes away. We may have to look at other queries
as well to see if they can be optimized by avoiding FROM - especially
now that we use TDB2 which seems to be more sensitive about this.
> FROM is more.
> FROM can be several graphs
>
> SELECT *
> FROM <g1>
> FROM <g2>
> WHERE ..
>
> meaning that the default graph is the union view of g1 and g2. And also
> that there are no named graphs (no use of FROM NAMED) so using GRAPH in
> the query will see nothing.
From a naive user perspective, here are a few reasons to prefer FROM
over GRAPH:
1. The syntax is more convenient and minimal. To target a specific
graph, add a FROM clause; to target the default graph, just remove it
(or comment out). GRAPH is more cumbersome since it adds another nesting
layer.
2. It's more versatile, as you say (though in this case the query didn't
really make use of the additional flexibility)
3. It looks more familiar if you know SQL.
> As to why there is a big difference - I don't know but the right thing
> is not to use FROM unless you want the union and hiding effects. It's
> about features not performance (the low level optimizer is being
> partially blocked - no quads).
I see, that's good to know - but IMHO not obvious just reading the
SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].
Just out of curiosity, I loaded the same data set into an RDF4J native
store (default settings) and tested both the FROM and GRAPH variants of
the query. There was no measurable difference in execution time: both
queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with
GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J
doesn't have such a big difference between FROM and GRAPH (but this
observation is based only on a single data point!).
> TDB2 is not about being faster than TDB1 - it's about better scale of
> transactions, more robust and better handling of inline datatypes.
> The TDB2 bulk loader is faster ... with enough hardware much faster
> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for
> multibillions datasets.
>
> Query-wise they are comparable.
>
> There are two possible effects:
>
> If you did a straight bulk load of the two datasets then TDB1 is going
> to be a little faster because the dataset has a slightly better layout
> initially. However, if any updates happen, the difference will fade -
> it's also a reflection of the data base size (470Mi) and OS caching.
Understood. I also agree about TDB2 having a better design and
especially the improved transaction support is valuable. I don't think
we want to go back to TDB1.
> It would be interesting to see the query figures if you loaded as
> triples - the triples table is smaller than the quads table with one graph.
Tested this briefly using Jena 3.16.0 only. I loaded the file into the
default graph of both a TDB1 and a TDB2 store, and timed the query which
of course had neither FROM nor GRAPH. I noticed that there was quite a
bit of variance in the timings I did yesterday so I went for another
measurement approach this time - I used 10 repetitions (without warmup)
and wrote down the minimum time from the 10 query executions.
TDB1: 0.825s
TDB2: 1.127s
No real difference to the quad version using GRAPH (see below).
> Another effect is that DB2 has a better internal design with more
> abstraction and classes. It may take more time for the JIT to fully
> optimize.
When using --repeat 10, the first query takes much longer but after that
there is no obvious pattern for the next queries. Some are faster than
others, but the second one is already as fast as any of the others.
>> Likewise, this makes me wonder whether there has been a mild decrease
>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at
>> intermediate versions to pinpoint the exact change (or several) that
>> would be causing the slowdown. If there's interest, I can try other
>> versions as well.
>
> Yes please. That would be useful to know.
OK, I did some additional measurements. I downloaded all releases from
3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on
both TDB1 and TDB2 stores. Again this is the minimum time for 10
repetitions.
Sidenote: It's quite difficult to do accurate benchmarking on a laptop.
I tried to eliminate interference from other applications, but there are
still aspects that are hard to control such as CPU frequency, which may
vary e.g. by temperature.
Here are the results:
TDB1
3.8.0: 0.904s
3.9.0: 0.901s
3.10.0: 0.874s
3.11.0: 0.906s
3.12.0: 0.871s
3.13.0: 0.903s
3.14.0: 0.917s
3.15.0: 0.916s
3.16.0: 0.892s
TDB2
3.8.0: 1.300s
3.9.0: 1.169s
3.10.0: 1.319s
3.11.0: 1.147s
3.12.0: 1.264s
3.13.0: 1.108s
3.14.0: 1.158s
3.15.0: 1.323s
3.16.0: 1.285s
There is no clear difference between versions this time (though TDB2 is
a bit slower overall than TDB1). It looks like the 40% decrease in
performance between 3.8.0 and 3.16.0 that I reported yesterday was a
measurement error; though it's conceivable that the average query time
has increased even if the minimum times have not. Nevertheless I don't
think this requires further study.
> PS 1991-27-59 is not a date :-)
I know :) This is what happens when you convert 200k MARC records from a
legacy system into RDF - errors in the data that nobody noticed before
will suddenly crop up. There is also a bad URI value starting with a
colon, though Jena doesn't complain loudly about it (RDF4J chokes
though). These have already been fixed in the original records, so they
will be corrected soon in the RDF data set as well when we regenerate it
in the next few days.
-Osma
[1] https://github.com/NatLibFi/Skosmos/pull/1098
[2] https://jena.apache.org/tutorials/sparql_datasets.html
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Andy Seaborne <an...@apache.org>.
Hi Osma,
This is misused of FROM.
Using FROM is not as simple as just selection one graph from the
dataset. It is creating a complete new dataset comprised of graph picked
out of a pool of graph in the TDB database.
The app should use GRAPH if that is what it means!
FROM is more.
FROM can be several graphs
SELECT *
FROM <g1>
FROM <g2>
WHERE ..
meaning that the default graph is the union view of g1 and g2. And also
that there are no named graphs (no use of FROM NAMED) so using GRAPH in
the query will see nothing.
As to why there is a big difference - I don't know but the right thing
is not to use FROM unless you want the union and hiding effects. It's
about features not performance (the low level optimizer is being
partially blocked - no quads).
TDB2 is not about being faster than TDB1 - it's about better scale of
transactions, more robust and better handling of inline datatypes.
The TDB2 bulk loader is faster ... with enough hardware much faster
(--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for
multibillions datasets.
Query-wise they are comparable.
There are two possible effects:
If you did a straight bulk load of the two datasets then TDB1 is going
to be a little faster because the dataset has a slightly better layout
initially. However, if any updates happen, the difference will fade -
it's also a reflection of the data base size (470Mi) and OS caching.
It would be interesting to see the query figures if you loaded as
triples - the triples table is smaller than the quads table with one graph.
Another effect is that DB2 has a better internal design with more
abstraction and classes. It may take more time for the JIT to fully
optimize.
> Likewise, this makes me wonder whether there has been a mild decrease in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at intermediate versions to pinpoint the exact change (or several) that would be causing the slowdown. If there's interest, I can try other versions as well.
Yes please. That would be useful to know.
Andy
PS 1991-27-59 is not a date :-)
On 30/11/2020 15:34, Osma Suominen wrote:
> Hello,
>
> We're in the process of replacing an old server that was still running
> Fuseki1 from Jena 3.8.0 with a TDB1 store. The new server has Fuseki2
> from Jena 3.16.0 and a TDB2 store.
>
> While testing the new server, I noticed that the new Fuseki is running a
> particular SPARQL query much slower than the old one. This is a query
> performed by Skosmos to find out all the letters of the alphabet for an
> alphabetical index by looking at all the skos:prefLabel values in a
> specific language. It's expected to be a bit slow (several seconds)
> since it needs to look at all the labels - but on the new server, the
> query is almost an order of magnitude slower, which is causing timeout
> issues.
>
> To investigate this more closely, I decided to drop Fuseki out of the
> equation and just use Jena command line utilities. I wanted to compare
> the effect of Jena versions (3.8.0 vs 3.16.0), store type (TDB1 vs
> TDB2), and variations of the original SPARQL query. For the data, I used
> the newly published KANTO/finaf data set (an authority file of named
> entities, i.e. persons and organizations) which can be downloaded from
> finto.fi [1]. It has around 3M triples, 200k skos:Concept instances and
> the same number of skos:prefLabel values.
>
> I loaded this into a TDB1 data set using Jena 3.7.0 (because of
> JENA-1575) like this:
>
> apache-jena-3.7.0/bin/tdbloader --loc tdb1 --graph
> http://example.org/finaf finaf-skos.ttl
>
> Likewise, I loaded the same data set into a TDB2 store using Jena 3.16.0:
>
> apache-jena-3.16.0/bin/tdb2.tdbloader --loc tdb2 --graph
> http://example.org/finaf finaf-skos.ttl
>
>
> This is the original SPARQL query:
>
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>
> SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
> FROM <http://example.org/finaf>
> WHERE {
> ?c a skos:Concept .
> ?c skos:prefLabel ?label .
> FILTER(langMatches(lang(?label), 'fi'))
> }
>
> The query should return 68 results. There is no particular order since
> there is no ORDER BY, they are just a set of letters and special
> characters such as numbers and punctuation.
>
> I ran it using tdbquery / tdb2.tdbquery, separately on both Jena 3.8.0
> and 3.16.0, using the options --time --repeat 2,10 (benchmark; two
> rounds of warming up, ten rounds of benchmarking) and wrote down the
> average query time, rounded to the first decimal point. I'm doing the
> benchmarks on an i5-7200U laptop with a pretty fast SSD.
>
> Jena 3.8.0 / TDB1: 2.1s
> Jena 3.16.0 / TDB1: 2.4s
> Jena 3.8.0 / TDB2: 11.8s
> Jena 3.16.0 / TDB2: 12.0s
>
> The difference between Jena versions is not very significant, but TDB2
> is 5-6 times slower than TDB1. Here is how tdbquery -v explains the
> query on the TDB level:
>
> 17:06:07 INFO exec :: TDB
> (distinct
> (project (?l)
> (extend ((?l (ucase (str (substr ?label 1 1)))))
> (filter (langMatches (lang ?label) "fi")
> (bgp
> (triple ?c
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://www.w3.org/2004/02/skos/core#Concept>)
> (triple ?c <http://www.w3.org/2004/02/skos/core#prefLabel>
> ?label)
> )))))
>
> The explanation is identical for tdb2.tdbquery so I won't repeat it.
>
> I then looked at ways of optimizing the query to make it perform better.
> After trying many variations (for example reordering the clauses and/or
> moving the substring expression to a BIND variable), the only change
> that seemed to have a significant effect was to remove the FROM clause
> and instead insert a GRAPH clause targeting the same graph, like this:
>
> PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
>
> SELECT DISTINCT (ucase(str(substr(?label, 1, 1))) as ?l)
> WHERE {
> GRAPH <http://example.org/finaf> {
> ?c a skos:Concept .
> ?c skos:prefLabel ?label .
> FILTER(langMatches(lang(?label), 'fi'))
> }
> }
>
> Benchmark results for this GRAPH version of the query:
>
> Jena 3.8.0 / TDB1: 0.9s
> Jena 3.16.0 / TDB1: 1.3s
> Jena 3.8.0 / TDB2: 1.4s
> Jena 3.16.0 / TDB2: 1.9s
>
> The results are much more even this time, though Jena 3.16.0 is about
> 40% slower than 3.8.0 and TDB2 is about 50% slower than TDB1. tdbquery
> -v (and tdb2.tdbquery -v) explains the query like this:
>
> 17:13:02 INFO exec :: TDB
> (distinct
> (project (?l)
> (extend ((?l (ucase (str (substr ?label 1 1)))))
> (filter (langMatches (lang ?label) "fi")
> (quadpattern
> (quad <http://example.org/finaf> ?c
> <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>
> <http://www.w3.org/2004/02/skos/core#Concept>)
> (quad <http://example.org/finaf> ?c
> <http://www.w3.org/2004/02/skos/core#prefLabel> ?label)
> )))))
>
> The difference I see compared to the previous query is the use of "quad"
> instead of "triple". My understanding of operations on the TDB level is
> pretty naive, but it seems to me this is now targeting the correct graph
> directly, instead of indirectly, as in the first case. This is a bit
> surprising to me since the "FROM <http://example.org/finaf>" clause in
> the first query is, to me, saying the same thing as the GRAPH clause:
> just target triples in this particular graph. Is there a missed
> opportunity for some optimization here? Why is FROM (much) worse than
> GRAPH?
>
> I also wonder why TDB2 is so much slower than TDB1, especially for the
> first version of the query. It should be an improvement, right? Of
> course there are trade-offs in implementing any complex system. But it
> makes me think whether we should stick to TDB1 for the time being, as
> there are no obvious benefits in using TDB2 for our current use.
>
> Likewise, this makes me wonder whether there has been a mild decrease in
> performance between Jena 3.8.0 and 3.16.0 - though I didn't look at
> intermediate versions to pinpoint the exact change (or several) that
> would be causing the slowdown. If there's interest, I can try other
> versions as well.
>
> For now we will probably just change Skosmos to use the GRAPH variant of
> the query, which should fix the immediate problems with timeouts.
> Unfortunately I don't have the skills to work directly on the ARQ
> optimizer or TDB2 code bases. But I'd be happy to test other variations
> and potential fixes to these performance problems.
>
> Cheers,
> Osma
>
> [1] https://finto.fi/rest/v1/finaf/data?format=text/turtle
>
>