You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2020/12/01 14:50:48 UTC
Re: Performance regressions in Jena and TDB2
Hi Andy!
Thanks again for your insight on this.
Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
> This is misused of FROM.
>
> Using FROM is not as simple as just selection one graph from the
> dataset. It is creating a complete new dataset comprised of graph picked
> out of a pool of graph in the TDB database.
>
> The app should use GRAPH if that is what it means!
It's a bit harsh to call this "misuse" of FROM, but I see your point.
I've been using SPARQL for over 10 years and I still get tripped up by
things that work (or not) in surprising ways...
Anyway I created a Skosmos PR [1] that switches this query to use GRAPH,
so the performance issue goes away. We may have to look at other queries
as well to see if they can be optimized by avoiding FROM - especially
now that we use TDB2 which seems to be more sensitive about this.
> FROM is more.
> FROM can be several graphs
>
> SELECT *
> FROM <g1>
> FROM <g2>
> WHERE ..
>
> meaning that the default graph is the union view of g1 and g2. And also
> that there are no named graphs (no use of FROM NAMED) so using GRAPH in
> the query will see nothing.
From a naive user perspective, here are a few reasons to prefer FROM
over GRAPH:
1. The syntax is more convenient and minimal. To target a specific
graph, add a FROM clause; to target the default graph, just remove it
(or comment out). GRAPH is more cumbersome since it adds another nesting
layer.
2. It's more versatile, as you say (though in this case the query didn't
really make use of the additional flexibility)
3. It looks more familiar if you know SQL.
> As to why there is a big difference - I don't know but the right thing
> is not to use FROM unless you want the union and hiding effects. It's
> about features not performance (the low level optimizer is being
> partially blocked - no quads).
I see, that's good to know - but IMHO not obvious just reading the
SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].
Just out of curiosity, I loaded the same data set into an RDF4J native
store (default settings) and tested both the FROM and GRAPH variants of
the query. There was no measurable difference in execution time: both
queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with
GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J
doesn't have such a big difference between FROM and GRAPH (but this
observation is based only on a single data point!).
> TDB2 is not about being faster than TDB1 - it's about better scale of
> transactions, more robust and better handling of inline datatypes.
> The TDB2 bulk loader is faster ... with enough hardware much faster
> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for
> multibillions datasets.
>
> Query-wise they are comparable.
>
> There are two possible effects:
>
> If you did a straight bulk load of the two datasets then TDB1 is going
> to be a little faster because the dataset has a slightly better layout
> initially. However, if any updates happen, the difference will fade -
> it's also a reflection of the data base size (470Mi) and OS caching.
Understood. I also agree about TDB2 having a better design and
especially the improved transaction support is valuable. I don't think
we want to go back to TDB1.
> It would be interesting to see the query figures if you loaded as
> triples - the triples table is smaller than the quads table with one graph.
Tested this briefly using Jena 3.16.0 only. I loaded the file into the
default graph of both a TDB1 and a TDB2 store, and timed the query which
of course had neither FROM nor GRAPH. I noticed that there was quite a
bit of variance in the timings I did yesterday so I went for another
measurement approach this time - I used 10 repetitions (without warmup)
and wrote down the minimum time from the 10 query executions.
TDB1: 0.825s
TDB2: 1.127s
No real difference to the quad version using GRAPH (see below).
> Another effect is that DB2 has a better internal design with more
> abstraction and classes. It may take more time for the JIT to fully
> optimize.
When using --repeat 10, the first query takes much longer but after that
there is no obvious pattern for the next queries. Some are faster than
others, but the second one is already as fast as any of the others.
>> Likewise, this makes me wonder whether there has been a mild decrease
>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at
>> intermediate versions to pinpoint the exact change (or several) that
>> would be causing the slowdown. If there's interest, I can try other
>> versions as well.
>
> Yes please. That would be useful to know.
OK, I did some additional measurements. I downloaded all releases from
3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on
both TDB1 and TDB2 stores. Again this is the minimum time for 10
repetitions.
Sidenote: It's quite difficult to do accurate benchmarking on a laptop.
I tried to eliminate interference from other applications, but there are
still aspects that are hard to control such as CPU frequency, which may
vary e.g. by temperature.
Here are the results:
TDB1
3.8.0: 0.904s
3.9.0: 0.901s
3.10.0: 0.874s
3.11.0: 0.906s
3.12.0: 0.871s
3.13.0: 0.903s
3.14.0: 0.917s
3.15.0: 0.916s
3.16.0: 0.892s
TDB2
3.8.0: 1.300s
3.9.0: 1.169s
3.10.0: 1.319s
3.11.0: 1.147s
3.12.0: 1.264s
3.13.0: 1.108s
3.14.0: 1.158s
3.15.0: 1.323s
3.16.0: 1.285s
There is no clear difference between versions this time (though TDB2 is
a bit slower overall than TDB1). It looks like the 40% decrease in
performance between 3.8.0 and 3.16.0 that I reported yesterday was a
measurement error; though it's conceivable that the average query time
has increased even if the minimum times have not. Nevertheless I don't
think this requires further study.
> PS 1991-27-59 is not a date :-)
I know :) This is what happens when you convert 200k MARC records from a
legacy system into RDF - errors in the data that nobody noticed before
will suddenly crop up. There is also a bad URI value starting with a
colon, though Jena doesn't complain loudly about it (RDF4J chokes
though). These have already been fixed in the original records, so they
will be corrected soon in the RDF data set as well when we regenerate it
in the next few days.
-Osma
[1] https://github.com/NatLibFi/Skosmos/pull/1098
[2] https://jena.apache.org/tutorials/sparql_datasets.html
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Osma Suominen <os...@helsinki.fi>.
Replying to myself, as I did some follow-up tests.
Osma Suominen kirjoitti 4.12.2020 klo 18.42:
> Now this turned into a rather interesting exercise in using git bisect.
> I was able to track down the change that caused the slowdown. It's this
> merge commit:
>
> [f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit
> 'refs/pull/507/head' of https://github.com/apache/jena
>
> which refers to this pull request:
>
> https://github.com/apache/jena/pull/507
>
> I don't have time for very deep analysis right now but it doesn't
> surprise me that a substantial change to the query result serialization
> slows down the queries.
>
> Things to check: (mostly as a TODO list for myself)
>
> 1. Does this depend on the query result format? For example, is only the
> text format (default) slower than before?
> 2. Is there something suspicious in the PR 507 code that would explain
> why it's so much slower?
This affects at least the CSV format too, so it's not just the text
output format.
But I figured out that the real change here is simply that the warmup
performed when using the --repeat parameter with two arguments has
become less effective starting with Jena 3.10.0. When no warmup is used,
the performance is the same for the different Jena versions.
And now that Andy implemented JENA-2007 which improves the warmup, I
think the problem has already been solved.
Case closed.
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!
Andy Seaborne kirjoitti 2.12.2020 klo 13.01:
>>> There is no reason I can see why the special case of exactly one
>>> "FROM" can't be handled specially. It masks all named graphs but is a
>>> rewrite from triples, that will be fine.
>> Right. Is this worth opening a JIRA issue?
> If you want. It doesn't make it happen though; taht takes coding.
Understood. I won't open an issue right now but maybe later. But I doubt
I'll be able to implement the change, at least not without some hints
and perhaps substantial help.
> I replaced the SELECT line with a COUNT to see where the time is going.
OK.
>> Now there is a clear pattern: starting from Jena 3.10.0, the first
>> timed run is much slower. So something happened there that makes the
>> first full query (after the partial warmup) take much longer than it
>> used to - but apparently not the subsequent ones which I timed yesterday.
>>
>> Any ideas what this could be about? Should this be investigated more?
> > It would be great if you would.
Now this turned into a rather interesting exercise in using git bisect.
I was able to track down the change that caused the slowdown. It's this
merge commit:
[f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit
'refs/pull/507/head' of https://github.com/apache/jena
which refers to this pull request:
https://github.com/apache/jena/pull/507
I don't have time for very deep analysis right now but it doesn't
surprise me that a substantial change to the query result serialization
slows down the queries.
Things to check: (mostly as a TODO list for myself)
1. Does this depend on the query result format? For example, is only the
text format (default) slower than before?
2. Is there something suspicious in the PR 507 code that would explain
why it's so much slower?
I have no idea right now how significant this performance regression is.
Judging from the fact that nobody has complained about it in almost 2
years, it doesn't seem to be very bad.
> I have a change lined up to fix "--out none" to be "less optimized"; it
> currently does nothing (very efficiently), it could consume silently the
> query results.
>
> It would also be better if the warmup with writing the required format
> to /dev/null would also be better.
I see that you already did this - great work!
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Andy Seaborne <an...@apache.org>.
On 02/12/2020 10:07, Osma Suominen wrote:
> Hi Andy!
>
> Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
>> There is no reason I can see why the special case of exactly one
>> "FROM" can't be handled specially. It masks all named graphs but is a
>> rewrite from triples, that will be fine.
>
> Right. Is this worth opening a JIRA issue?
If you want. It doesn't make it happen though; taht takes coding.
>
>>> TDB1: 0.825s
>>> TDB2: 1.127s
>>>
>>> No real difference to the quad version using GRAPH (see below).
>>
>> For me:
>> About 300-400ms is going in the SELECT expression.
>
> Do you mean that it would be possible to further optimize the query
> while getting the same result? I tested moving the SELECT expression
> inside the query as a BIND, but it didn't seem to make any difference.
I replaced the SELECT line with a COUNT to see where the time is going.
>> The query timing has a bug - it does not preload the result set
>> formatting during the warmup so classloading (and JIT but that is a
>> lesser issue) happen during the main execution. Approximately, junk
>> the first timed run.
>>
>> The rest go up and down because :
>> (1) other stuff on the machine going tick
>> (2) bits of successive JIT happening
>
> That gave me an idea. I was wondering why there was such a clear
> difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where
> I used average query times after the warmup), but no difference in the
> second one (where I used minimum query times).
>
> I decided to look at the time for the first query after the partial
> warmup - when you said "junk the first timed run" I did exactly the
> opposite :) So I benchmarked the GRAPH variant of the query, on both
> TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used
> --repeat 1,1 and ran all the queries 5 times as separate invocations,
> noting the minimum time. The results:
Lots of class loading and JIT ...
>
> TDB1
>
> 3.8.0: 1.0s
> 3.9.0: 1.0s
> 3.10.0: 3.5s
> 3.11.0: 3.4s
> 3.12.0: 3.6s
> 3.13.0: 3.2s
> 3.14.0: 3.2s
> 3.15.0: 3.2s
> 3.16.0: 3.1s
>
> TDB2
>
> 3.8.0: 1.4s
> 3.9.0: 1.4s
> 3.10.0: 5.4s
> 3.11.0: 5.7s
> 3.12.0: 5.6s
> 3.13.0: 5.9s
> 3.14.0: 5.4s
> 3.15.0: 5.6s
> 3.16.0: 5.7s
>
> Now there is a clear pattern: starting from Jena 3.10.0, the first timed
> run is much slower. So something happened there that makes the first
> full query (after the partial warmup) take much longer than it used to -
> but apparently not the subsequent ones which I timed yesterday.
>
> Any ideas what this could be about? Should this be investigated more?
It would be great if you would.
I have a change lined up to fix "--out none" to be "less optimized"; it
currently does nothing (very efficiently), it could consume silently the
query results.
It would also be better if the warmup with writing the required format
to /dev/null would also be better.
Andy
> I
> looked at the 3.10.0 release notes but I couldn't see anything obviously
> related. Of course the release happened almost two years ago, so it's
> been a while...
>
> -Osma
>
Re: Performance regressions in Jena and TDB2
Posted by Osma Suominen <os...@helsinki.fi>.
Hi Andy!
Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
> There is no reason I can see why the special case of exactly one "FROM"
> can't be handled specially. It masks all named graphs but is a rewrite
> from triples, that will be fine.
Right. Is this worth opening a JIRA issue?
>> TDB1: 0.825s
>> TDB2: 1.127s
>>
>> No real difference to the quad version using GRAPH (see below).
>
> For me:
> About 300-400ms is going in the SELECT expression.
Do you mean that it would be possible to further optimize the query
while getting the same result? I tested moving the SELECT expression
inside the query as a BIND, but it didn't seem to make any difference.
> The query timing has a bug - it does not preload the result set
> formatting during the warmup so classloading (and JIT but that is a
> lesser issue) happen during the main execution. Approximately, junk the
> first timed run.
>
> The rest go up and down because :
> (1) other stuff on the machine going tick
> (2) bits of successive JIT happening
That gave me an idea. I was wondering why there was such a clear
difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where
I used average query times after the warmup), but no difference in the
second one (where I used minimum query times).
I decided to look at the time for the first query after the partial
warmup - when you said "junk the first timed run" I did exactly the
opposite :) So I benchmarked the GRAPH variant of the query, on both
TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used
--repeat 1,1 and ran all the queries 5 times as separate invocations,
noting the minimum time. The results:
TDB1
3.8.0: 1.0s
3.9.0: 1.0s
3.10.0: 3.5s
3.11.0: 3.4s
3.12.0: 3.6s
3.13.0: 3.2s
3.14.0: 3.2s
3.15.0: 3.2s
3.16.0: 3.1s
TDB2
3.8.0: 1.4s
3.9.0: 1.4s
3.10.0: 5.4s
3.11.0: 5.7s
3.12.0: 5.6s
3.13.0: 5.9s
3.14.0: 5.4s
3.15.0: 5.6s
3.16.0: 5.7s
Now there is a clear pattern: starting from Jena 3.10.0, the first timed
run is much slower. So something happened there that makes the first
full query (after the partial warmup) take much longer than it used to -
but apparently not the subsequent ones which I timed yesterday.
Any ideas what this could be about? Should this be investigated more? I
looked at the 3.10.0 release notes but I couldn't see anything obviously
related. Of course the release happened almost two years ago, so it's
been a while...
-Osma
--
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi
Re: Performance regressions in Jena and TDB2
Posted by Andy Seaborne <an...@apache.org>.
On 01/12/2020 14:50, Osma Suominen wrote:
> Hi Andy!
>
> Thanks again for your insight on this.
>
> Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
>> This is misused of FROM.
>>
>> Using FROM is not as simple as just selection one graph from the
>> dataset. It is creating a complete new dataset comprised of graph
>> picked out of a pool of graph in the TDB database.
>>
>> The app should use GRAPH if that is what it means!
>
> It's a bit harsh to call this "misuse" of FROM, but I see your point.
> I've been using SPARQL for over 10 years and I still get tripped up by
> things that work (or not) in surprising ways...
>
> Anyway I created a Skosmos PR [1] that switches this query to use GRAPH,
> so the performance issue goes away. We may have to look at other queries
> as well to see if they can be optimized by avoiding FROM - especially
> now that we use TDB2 which seems to be more sensitive about this.
There is no reason I can see why the special case of exactly one "FROM"
can't be handled specially. It masks all named graphs but is a rewrite
from triples, that will be fine.
>> FROM is more.
>> FROM can be several graphs
>>
>> SELECT *
>> FROM <g1>
>> FROM <g2>
>> WHERE ..
>>
>> meaning that the default graph is the union view of g1 and g2. And
>> also that there are no named graphs (no use of FROM NAMED) so using
>> GRAPH in the query will see nothing.
>
> From a naive user perspective, here are a few reasons to prefer FROM
> over GRAPH:
>
> 1. The syntax is more convenient and minimal. To target a specific
> graph, add a FROM clause; to target the default graph, just remove it
> (or comment out). GRAPH is more cumbersome since it adds another nesting
> layer.
>
> 2. It's more versatile, as you say (though in this case the query didn't
> really make use of the additional flexibility)
>
> 3. It looks more familiar if you know SQL.
>
>> As to why there is a big difference - I don't know but the right thing
>> is not to use FROM unless you want the union and hiding effects. It's
>> about features not performance (the low level optimizer is being
>> partially blocked - no quads).
>
> I see, that's good to know - but IMHO not obvious just reading the
> SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].
>
> Just out of curiosity, I loaded the same data set into an RDF4J native
> store (default settings) and tested both the FROM and GRAPH variants of
> the query. There was no measurable difference in execution time: both
> queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with
> GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J
> doesn't have such a big difference between FROM and GRAPH (but this
> observation is based only on a single data point!).
RDF4J has a different graph storage. In a way, it always "FROM" like. I
don't know if it is the same full FROM/FROM NAMED feature.
The spec is carefully worded by design to allow the different systems,
RDF4J included to operate in a compliant manner. Systems already had
different behaviours.
>
>> TDB2 is not about being faster than TDB1 - it's about better scale of
>> transactions, more robust and better handling of inline datatypes.
>> The TDB2 bulk loader is faster ... with enough hardware much faster
>> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster
>> for multibillions datasets.
>>
>> Query-wise they are comparable.
>>
>> There are two possible effects:
>>
>> If you did a straight bulk load of the two datasets then TDB1 is going
>> to be a little faster because the dataset has a slightly better layout
>> initially. However, if any updates happen, the difference will fade -
>> it's also a reflection of the data base size (470Mi) and OS caching.
>
> Understood. I also agree about TDB2 having a better design and
> especially the improved transaction support is valuable. I don't think
> we want to go back to TDB1.
>
>> It would be interesting to see the query figures if you loaded as
>> triples - the triples table is smaller than the quads table with one
>> graph.
>
> Tested this briefly using Jena 3.16.0 only. I loaded the file into the
> default graph of both a TDB1 and a TDB2 store, and timed the query which
> of course had neither FROM nor GRAPH. I noticed that there was quite a
> bit of variance in the timings I did yesterday so I went for another
> measurement approach this time - I used 10 repetitions (without warmup)
> and wrote down the minimum time from the 10 query executions.
>
> TDB1: 0.825s
> TDB2: 1.127s
>
> No real difference to the quad version using GRAPH (see below).
For me:
About 300-400ms is going in the SELECT expression.
>
>> Another effect is that DB2 has a better internal design with more
>> abstraction and classes. It may take more time for the JIT to fully
>> optimize.
>
> When using --repeat 10, the first query takes much longer but after that
> there is no obvious pattern for the next queries. Some are faster than
> others, but the second one is already as fast as any of the others.
The query timing has a bug - it does not preload the result set
formatting during the warmup so classloading (and JIT but that is a
lesser issue) happen during the main execution. Approximately, junk the
first timed run.
The rest go up and down because :
(1) other stuff on the machine going tick
(2) bits of successive JIT happening
>
>>> Likewise, this makes me wonder whether there has been a mild decrease
>>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look
>>> at intermediate versions to pinpoint the exact change (or several)
>>> that would be causing the slowdown. If there's interest, I can try
>>> other versions as well.
>>
>> Yes please. That would be useful to know.
>
> OK, I did some additional measurements. I downloaded all releases from
> 3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on
> both TDB1 and TDB2 stores. Again this is the minimum time for 10
> repetitions.
>
> Sidenote: It's quite difficult to do accurate benchmarking on a laptop.
Yes, isn't it :-)
There are other application both competign for RAM and also waking up in
the background (Chrome! Thunderbird!), and the OS doing file system
cache maintenance.
> I tried to eliminate interference from other applications, but there are
> still aspects that are hard to control such as CPU frequency, which may
> vary e.g. by temperature.
>
> Here are the results:
>
> TDB1
>
> 3.8.0: 0.904s
> 3.9.0: 0.901s
> 3.10.0: 0.874s
> 3.11.0: 0.906s
> 3.12.0: 0.871s
> 3.13.0: 0.903s
> 3.14.0: 0.917s
> 3.15.0: 0.916s
> 3.16.0: 0.892s
>
>
> TDB2
>
> 3.8.0: 1.300s
> 3.9.0: 1.169s
> 3.10.0: 1.319s
> 3.11.0: 1.147s
> 3.12.0: 1.264s
> 3.13.0: 1.108s
> 3.14.0: 1.158s
> 3.15.0: 1.323s
> 3.16.0: 1.285s
>
> There is no clear difference between versions this time (though TDB2 is
> a bit slower overall than TDB1). It looks like the 40% decrease in
> performance between 3.8.0 and 3.16.0 that I reported yesterday was a
> measurement error; though it's conceivable that the average query time
> has increased even if the minimum times have not. Nevertheless I don't
> think this requires further study.
>
>> PS 1991-27-59 is not a date :-)
>
> I know :) This is what happens when you convert 200k MARC records from a
> legacy system into RDF - errors in the data that nobody noticed before
> will suddenly crop up. There is also a bad URI value starting with a
> colon, though Jena doesn't complain loudly about it (RDF4J chokes
> though). These have already been fixed in the original records, so they
> will be corrected soon in the RDF data set as well when we regenerate it
> in the next few days.
>
> -Osma
>
> [1] https://github.com/NatLibFi/Skosmos/pull/1098
>
> [2] https://jena.apache.org/tutorials/sparql_datasets.html
>
>
>