You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Osma Suominen <os...@helsinki.fi> on 2020/12/01 14:50:48 UTC

Re: Performance regressions in Jena and TDB2

Hi Andy!

Thanks again for your insight on this.

Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
> This is misused of FROM.
> 
> Using FROM is not as simple as just selection one graph from the 
> dataset. It is creating a complete new dataset comprised of graph picked 
> out of a pool of graph in the TDB database.
> 
> The app should use GRAPH if that is what it means!

It's a bit harsh to call this "misuse" of FROM, but I see your point. 
I've been using SPARQL for over 10 years and I still get tripped up by 
things that work (or not) in surprising ways...

Anyway I created a Skosmos PR [1] that switches this query to use GRAPH, 
so the performance issue goes away. We may have to look at other queries 
as well to see if they can be optimized by avoiding FROM - especially 
now that we use TDB2 which seems to be more sensitive about this.

> FROM is more.
> FROM can be several graphs
> 
> SELECT *
> FROM <g1>
> FROM <g2>
> WHERE ..
> 
> meaning that the default graph is the union view of g1 and g2.  And also 
> that there are no named graphs (no use of FROM NAMED) so using GRAPH in 
> the query will see nothing.

 From a naive user perspective, here are a few reasons to prefer FROM 
over GRAPH:

1. The syntax is more convenient and minimal. To target a specific 
graph, add a FROM clause; to target the default graph, just remove it 
(or comment out). GRAPH is more cumbersome since it adds another nesting 
layer.

2. It's more versatile, as you say (though in this case the query didn't 
really make use of the additional flexibility)

3. It looks more familiar if you know SQL.

> As to why there is a big difference - I don't know but the right thing 
> is not to use FROM unless you want the union and hiding effects. It's 
> about features not performance (the low level optimizer is being 
> partially blocked - no quads).

I see, that's good to know - but IMHO not obvious just reading the 
SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].

Just out of curiosity, I loaded the same data set into an RDF4J native 
store (default settings) and tested both the FROM and GRAPH variants of 
the query. There was no measurable difference in execution time: both 
queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with 
GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J 
doesn't have such a big difference between FROM and GRAPH (but this 
observation is based only on a single data point!).

> TDB2 is not about being faster than TDB1 - it's about better scale of 
> transactions, more robust and better handling of inline datatypes.
> The TDB2 bulk loader is faster ... with enough hardware much faster 
> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster for 
> multibillions datasets.
> 
> Query-wise they are comparable.
> 
> There are two possible effects:
> 
> If you did a straight bulk load of the two datasets then TDB1 is going 
> to be a little faster because the dataset has a slightly better layout 
> initially. However, if any updates happen, the difference will fade - 
> it's also a reflection of the data base size (470Mi) and OS caching.

Understood. I also agree about TDB2 having a better design and 
especially the improved transaction support is valuable. I don't think 
we want to go back to TDB1.

> It would be interesting to see the query figures if you loaded as 
> triples - the triples table is smaller than the quads table with one graph.

Tested this briefly using Jena 3.16.0 only. I loaded the file into the 
default graph of both a TDB1 and a TDB2 store, and timed the query which 
of course had neither FROM nor GRAPH. I noticed that there was quite a 
bit of variance in the timings I did yesterday so I went for another 
measurement approach this time - I used 10 repetitions (without warmup) 
and wrote down the minimum time from the 10 query executions.

TDB1: 0.825s
TDB2: 1.127s

No real difference to the quad version using GRAPH (see below).

> Another effect is that DB2 has a better internal design with more 
> abstraction and classes. It may take more time for the JIT to fully 
> optimize.

When using --repeat 10, the first query takes much longer but after that 
there is no obvious pattern for the next queries. Some are faster than 
others, but the second one is already as fast as any of the others.

>> Likewise, this makes me wonder whether there has been a mild decrease 
>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look at 
>> intermediate versions to pinpoint the exact change (or several) that 
>> would be causing the slowdown. If there's interest, I can try other 
>> versions as well. 
> 
> Yes please. That would be useful to know.

OK, I did some additional measurements. I downloaded all releases from 
3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on 
both TDB1 and TDB2 stores. Again this is the minimum time for 10 
repetitions.

Sidenote: It's quite difficult to do accurate benchmarking on a laptop. 
I tried to eliminate interference from other applications, but there are 
still aspects that are hard to control such as CPU frequency, which may 
vary e.g. by temperature.

Here are the results:

TDB1

3.8.0: 0.904s
3.9.0: 0.901s
3.10.0: 0.874s
3.11.0: 0.906s
3.12.0: 0.871s
3.13.0: 0.903s
3.14.0: 0.917s
3.15.0: 0.916s
3.16.0: 0.892s


TDB2

3.8.0: 1.300s
3.9.0: 1.169s
3.10.0: 1.319s
3.11.0: 1.147s
3.12.0: 1.264s
3.13.0: 1.108s
3.14.0: 1.158s
3.15.0: 1.323s
3.16.0: 1.285s

There is no clear difference between versions this time (though TDB2 is 
a bit slower overall than TDB1). It looks like the 40% decrease in 
performance between 3.8.0 and 3.16.0 that I reported yesterday was a 
measurement error; though it's conceivable that the average query time 
has increased even if the minimum times have not. Nevertheless I don't 
think this requires further study.

> PS 1991-27-59 is not a date :-)

I know :) This is what happens when you convert 200k MARC records from a 
legacy system into RDF - errors in the data that nobody noticed before 
will suddenly crop up. There is also a bad URI value starting with a 
colon, though Jena doesn't complain loudly about it (RDF4J chokes 
though). These have already been fixed in the original records, so they 
will be corrected soon in the RDF data set as well when we regenerate it 
in the next few days.

-Osma

[1] https://github.com/NatLibFi/Skosmos/pull/1098

[2] https://jena.apache.org/tutorials/sparql_datasets.html



-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Osma Suominen <os...@helsinki.fi>.

Replying to myself, as I did some follow-up tests.

Osma Suominen kirjoitti 4.12.2020 klo 18.42:
> Now this turned into a rather interesting exercise in using git bisect. 
> I was able to track down the change that caused the slowdown. It's this 
> merge commit:
> 
> [f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit 
> 'refs/pull/507/head' of https://github.com/apache/jena
> 
> which refers to this pull request:
> 
> https://github.com/apache/jena/pull/507
> 
> I don't have time for very deep analysis right now but it doesn't 
> surprise me that a substantial change to the query result serialization 
> slows down the queries.
> 
> Things to check: (mostly as a TODO list for myself)
> 
> 1. Does this depend on the query result format? For example, is only the 
> text format (default) slower than before?
> 2. Is there something suspicious in the PR 507 code that would explain 
> why it's so much slower?

This affects at least the CSV format too, so it's not just the text 
output format.

But I figured out that the real change here is simply that the warmup 
performed when using the --repeat parameter with two arguments has 
become less effective starting with Jena 3.10.0. When no warmup is used, 
the performance is the same for the different Jena versions.

And now that Andy implemented JENA-2007 which improves the warmup, I 
think the problem has already been solved.

Case closed.

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

Andy Seaborne kirjoitti 2.12.2020 klo 13.01:
>>> There is no reason I can see why the special case of exactly one 
>>> "FROM" can't be handled specially. It masks all named graphs but is a 
>>> rewrite from triples, that will be fine.
>> Right. Is this worth opening a JIRA issue?
> If you want. It doesn't make it happen though; taht takes coding.

Understood. I won't open an issue right now but maybe later. But I doubt 
I'll be able to implement the change, at least not without some hints 
and perhaps substantial help.

> I replaced the SELECT line with a COUNT to see where the time is going.

OK.

>> Now there is a clear pattern: starting from Jena 3.10.0, the first 
>> timed run is much slower. So something happened there that makes the 
>> first full query (after the partial warmup) take much longer than it 
>> used to - but apparently not the subsequent ones which I timed yesterday.
>>
>> Any ideas what this could be about? Should this be investigated more?
> > It would be great if you would.

Now this turned into a rather interesting exercise in using git bisect. 
I was able to track down the change that caused the slowdown. It's this 
merge commit:

[f93fdbad7aa8d6ddb46693395e3bfb5ea487bf16] JENA-1648: Merge commit 
'refs/pull/507/head' of https://github.com/apache/jena

which refers to this pull request:

https://github.com/apache/jena/pull/507

I don't have time for very deep analysis right now but it doesn't 
surprise me that a substantial change to the query result serialization 
slows down the queries.

Things to check: (mostly as a TODO list for myself)

1. Does this depend on the query result format? For example, is only the 
text format (default) slower than before?
2. Is there something suspicious in the PR 507 code that would explain 
why it's so much slower?

I have no idea right now how significant this performance regression is. 
Judging from the fact that nobody has complained about it in almost 2 
years, it doesn't seem to be very bad.

> I have a change lined up to fix "--out none" to be "less optimized"; it 
> currently does nothing (very efficiently), it could consume silently the 
> query results.
> 
> It would also be better if the warmup with writing the required format 
> to /dev/null would also be better.

I see that you already did this - great work!

-Osma


-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 26 (Kaikukatu 4)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Andy Seaborne <an...@apache.org>.


On 02/12/2020 10:07, Osma Suominen wrote:
> Hi Andy!
> 
> Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
>> There is no reason I can see why the special case of exactly one 
>> "FROM" can't be handled specially. It masks all named graphs but is a 
>> rewrite from triples, that will be fine.
> 
> Right. Is this worth opening a JIRA issue?

If you want. It doesn't make it happen though; taht takes coding.

> 
>>> TDB1: 0.825s
>>> TDB2: 1.127s
>>>
>>> No real difference to the quad version using GRAPH (see below).
>>
>> For me:
>> About 300-400ms is going in the SELECT expression.
> 
> Do you mean that it would be possible to further optimize the query 
> while getting the same result? I tested moving the SELECT expression 
> inside the query as a BIND, but it didn't seem to make any difference.

I replaced the SELECT line with a COUNT to see where the time is going.

>> The query timing has a bug - it does not preload the result set 
>> formatting during the warmup so classloading (and JIT but that is a 
>> lesser issue) happen during the main execution. Approximately, junk 
>> the first timed run.
>>
>> The rest go up and down because :
>> (1) other stuff on the machine going tick
>> (2) bits of successive JIT happening
> 
> That gave me an idea. I was wondering why there was such a clear 
> difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where 
> I used average query times after the warmup), but no difference in the 
> second one (where I used minimum query times).
> 
> I decided to look at the time for the first query after the partial 
> warmup - when you said "junk the first timed run" I did exactly the 
> opposite :) So I benchmarked the GRAPH variant of the query, on both 
> TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used 
> --repeat 1,1 and ran all the queries 5 times as separate invocations, 
> noting the minimum time. The results:

Lots of class loading and JIT ...

> 
> TDB1
> 
> 3.8.0: 1.0s
> 3.9.0: 1.0s
> 3.10.0: 3.5s
> 3.11.0: 3.4s
> 3.12.0: 3.6s
> 3.13.0: 3.2s
> 3.14.0: 3.2s
> 3.15.0: 3.2s
> 3.16.0: 3.1s
> 
> TDB2
> 
> 3.8.0: 1.4s
> 3.9.0: 1.4s
> 3.10.0: 5.4s
> 3.11.0: 5.7s
> 3.12.0: 5.6s
> 3.13.0: 5.9s
> 3.14.0: 5.4s
> 3.15.0: 5.6s
> 3.16.0: 5.7s
> 
> Now there is a clear pattern: starting from Jena 3.10.0, the first timed 
> run is much slower. So something happened there that makes the first 
> full query (after the partial warmup) take much longer than it used to - 
> but apparently not the subsequent ones which I timed yesterday.
> 
> Any ideas what this could be about? Should this be investigated more?

It would be great if you would.

I have a change lined up to fix "--out none" to be "less optimized"; it 
currently does nothing (very efficiently), it could consume silently the 
query results.

It would also be better if the warmup with writing the required format 
to /dev/null would also be better.

     Andy

> I 
> looked at the 3.10.0 release notes but I couldn't see anything obviously 
> related. Of course the release happened almost two years ago, so it's 
> been a while...
> 
> -Osma
>

Re: Performance regressions in Jena and TDB2

Posted by Osma Suominen <os...@helsinki.fi>.

Hi Andy!

Andy Seaborne kirjoitti 1.12.2020 klo 23.15:
> There is no reason I can see why the special case of exactly one "FROM" 
> can't be handled specially. It masks all named graphs but is a rewrite 
> from triples, that will be fine.

Right. Is this worth opening a JIRA issue?

>> TDB1: 0.825s
>> TDB2: 1.127s
>>
>> No real difference to the quad version using GRAPH (see below).
> 
> For me:
> About 300-400ms is going in the SELECT expression.

Do you mean that it would be possible to further optimize the query 
while getting the same result? I tested moving the SELECT expression 
inside the query as a BIND, but it didn't seem to make any difference.

> The query timing has a bug - it does not preload the result set 
> formatting during the warmup so classloading (and JIT but that is a 
> lesser issue) happen during the main execution. Approximately, junk the 
> first timed run.
> 
> The rest go up and down because :
> (1) other stuff on the machine going tick
> (2) bits of successive JIT happening

That gave me an idea. I was wondering why there was such a clear 
difference between Jena 3.8.0 and 3.16.0 in the initial benchmark (where 
I used average query times after the warmup), but no difference in the 
second one (where I used minimum query times).

I decided to look at the time for the first query after the partial 
warmup - when you said "junk the first timed run" I did exactly the 
opposite :) So I benchmarked the GRAPH variant of the query, on both 
TDB1 and TDB2, using all Jena releases from 3.8.0 to 3.16.0. I used 
--repeat 1,1 and ran all the queries 5 times as separate invocations, 
noting the minimum time. The results:

TDB1

3.8.0: 1.0s
3.9.0: 1.0s
3.10.0: 3.5s
3.11.0: 3.4s
3.12.0: 3.6s
3.13.0: 3.2s
3.14.0: 3.2s
3.15.0: 3.2s
3.16.0: 3.1s

TDB2

3.8.0: 1.4s
3.9.0: 1.4s
3.10.0: 5.4s
3.11.0: 5.7s
3.12.0: 5.6s
3.13.0: 5.9s
3.14.0: 5.4s
3.15.0: 5.6s
3.16.0: 5.7s

Now there is a clear pattern: starting from Jena 3.10.0, the first timed 
run is much slower. So something happened there that makes the first 
full query (after the partial warmup) take much longer than it used to - 
but apparently not the subsequent ones which I timed yesterday.

Any ideas what this could be about? Should this be investigated more? I 
looked at the 3.10.0 release notes but I couldn't see anything obviously 
related. Of course the release happened almost two years ago, so it's 
been a while...

-Osma

-- 
Osma Suominen
D.Sc. (Tech), Information Systems Specialist
National Library of Finland
P.O. Box 15 (Unioninkatu 36)
00014 HELSINGIN YLIOPISTO
Tel. +358 50 3199529
osma.suominen@helsinki.fi
http://www.nationallibrary.fi

Re: Performance regressions in Jena and TDB2

Posted by Andy Seaborne <an...@apache.org>.


On 01/12/2020 14:50, Osma Suominen wrote:
> Hi Andy!
> 
> Thanks again for your insight on this.
> 
> Andy Seaborne kirjoitti 30.11.2020 klo 22.49:
>> This is misused of FROM.
>>
>> Using FROM is not as simple as just selection one graph from the 
>> dataset. It is creating a complete new dataset comprised of graph 
>> picked out of a pool of graph in the TDB database.
>>
>> The app should use GRAPH if that is what it means!
> 
> It's a bit harsh to call this "misuse" of FROM, but I see your point. 
> I've been using SPARQL for over 10 years and I still get tripped up by 
> things that work (or not) in surprising ways...
> 
> Anyway I created a Skosmos PR [1] that switches this query to use GRAPH, 
> so the performance issue goes away. We may have to look at other queries 
> as well to see if they can be optimized by avoiding FROM - especially 
> now that we use TDB2 which seems to be more sensitive about this.

There is no reason I can see why the special case of exactly one "FROM" 
can't be handled specially. It masks all named graphs but is a rewrite 
from triples, that will be fine.

>> FROM is more.
>> FROM can be several graphs
>>
>> SELECT *
>> FROM <g1>
>> FROM <g2>
>> WHERE ..
>>
>> meaning that the default graph is the union view of g1 and g2.  And 
>> also that there are no named graphs (no use of FROM NAMED) so using 
>> GRAPH in the query will see nothing.
> 
>  From a naive user perspective, here are a few reasons to prefer FROM 
> over GRAPH:
> 
> 1. The syntax is more convenient and minimal. To target a specific 
> graph, add a FROM clause; to target the default graph, just remove it 
> (or comment out). GRAPH is more cumbersome since it adds another nesting 
> layer.
> 
> 2. It's more versatile, as you say (though in this case the query didn't 
> really make use of the additional flexibility)
> 
> 3. It looks more familiar if you know SQL.
> 
>> As to why there is a big difference - I don't know but the right thing 
>> is not to use FROM unless you want the union and hiding effects. It's 
>> about features not performance (the low level optimizer is being 
>> partially blocked - no quads).
> 
> I see, that's good to know - but IMHO not obvious just reading the 
> SPARQL query spec or e.g. the Jena tutorial on SPARQL datasets [2].
> 
> Just out of curiosity, I loaded the same data set into an RDF4J native 
> store (default settings) and tested both the FROM and GRAPH variants of 
> the query. There was no measurable difference in execution time: both 
> queries took 3.3 seconds. So this is slower than TDB1, and TDB2 with 
> GRAPH, but a lot faster than TDB2 for the FROM variant. Apparently RDF4J 
> doesn't have such a big difference between FROM and GRAPH (but this 
> observation is based only on a single data point!).

RDF4J has a different graph storage. In a way, it always "FROM" like. I 
don't know if it is the same full FROM/FROM NAMED feature.

The spec is carefully worded by design to allow the different systems, 
RDF4J included to operate in a compliant manner. Systems already had 
different behaviours.

> 
>> TDB2 is not about being faster than TDB1 - it's about better scale of 
>> transactions, more robust and better handling of inline datatypes.
>> The TDB2 bulk loader is faster ... with enough hardware much faster 
>> (--loader=parallel) at 100's millions. TDB1 tdbloader2 can be faster 
>> for multibillions datasets.
>>
>> Query-wise they are comparable.
>>
>> There are two possible effects:
>>
>> If you did a straight bulk load of the two datasets then TDB1 is going 
>> to be a little faster because the dataset has a slightly better layout 
>> initially. However, if any updates happen, the difference will fade - 
>> it's also a reflection of the data base size (470Mi) and OS caching.
> 
> Understood. I also agree about TDB2 having a better design and 
> especially the improved transaction support is valuable. I don't think 
> we want to go back to TDB1.
> 
>> It would be interesting to see the query figures if you loaded as 
>> triples - the triples table is smaller than the quads table with one 
>> graph.
> 
> Tested this briefly using Jena 3.16.0 only. I loaded the file into the 
> default graph of both a TDB1 and a TDB2 store, and timed the query which 
> of course had neither FROM nor GRAPH. I noticed that there was quite a 
> bit of variance in the timings I did yesterday so I went for another 
> measurement approach this time - I used 10 repetitions (without warmup) 
> and wrote down the minimum time from the 10 query executions.
> 
> TDB1: 0.825s
> TDB2: 1.127s
> 
> No real difference to the quad version using GRAPH (see below).

For me:
About 300-400ms is going in the SELECT expression.

> 
>> Another effect is that DB2 has a better internal design with more 
>> abstraction and classes. It may take more time for the JIT to fully 
>> optimize.
> 
> When using --repeat 10, the first query takes much longer but after that 
> there is no obvious pattern for the next queries. Some are faster than 
> others, but the second one is already as fast as any of the others.

The query timing has a bug - it does not preload the result set 
formatting during the warmup so classloading (and JIT but that is a 
lesser issue) happen during the main execution. Approximately, junk the 
first timed run.

The rest go up and down because :
(1) other stuff on the machine going tick
(2) bits of successive JIT happening

> 
>>> Likewise, this makes me wonder whether there has been a mild decrease 
>>> in performance between Jena 3.8.0 and 3.16.0 - though I didn't look 
>>> at intermediate versions to pinpoint the exact change (or several) 
>>> that would be causing the slowdown. If there's interest, I can try 
>>> other versions as well. 
>>
>> Yes please. That would be useful to know.
> 
> OK, I did some additional measurements. I downloaded all releases from 
> 3.8.0 up to 3.16.0 and benchmarked the GRAPH version of the query on 
> both TDB1 and TDB2 stores. Again this is the minimum time for 10 
> repetitions.
> 
> Sidenote: It's quite difficult to do accurate benchmarking on a laptop. 

Yes, isn't it :-)

There are other application both competign for RAM and also waking up in 
the background (Chrome! Thunderbird!), and the OS doing file system 
cache maintenance.

> I tried to eliminate interference from other applications, but there are 
> still aspects that are hard to control such as CPU frequency, which may 
> vary e.g. by temperature.
> 
> Here are the results:
> 
> TDB1
> 
> 3.8.0: 0.904s
> 3.9.0: 0.901s
> 3.10.0: 0.874s
> 3.11.0: 0.906s
> 3.12.0: 0.871s
> 3.13.0: 0.903s
> 3.14.0: 0.917s
> 3.15.0: 0.916s
> 3.16.0: 0.892s
> 
> 
> TDB2
> 
> 3.8.0: 1.300s
> 3.9.0: 1.169s
> 3.10.0: 1.319s
> 3.11.0: 1.147s
> 3.12.0: 1.264s
> 3.13.0: 1.108s
> 3.14.0: 1.158s
> 3.15.0: 1.323s
> 3.16.0: 1.285s
> 
> There is no clear difference between versions this time (though TDB2 is 
> a bit slower overall than TDB1). It looks like the 40% decrease in 
> performance between 3.8.0 and 3.16.0 that I reported yesterday was a 
> measurement error; though it's conceivable that the average query time 
> has increased even if the minimum times have not. Nevertheless I don't 
> think this requires further study.
> 
>> PS 1991-27-59 is not a date :-)
> 
> I know :) This is what happens when you convert 200k MARC records from a 
> legacy system into RDF - errors in the data that nobody noticed before 
> will suddenly crop up. There is also a bad URI value starting with a 
> colon, though Jena doesn't complain loudly about it (RDF4J chokes 
> though). These have already been fixed in the original records, so they 
> will be corrected soon in the RDF data set as well when we regenerate it 
> in the next few days.
> 
> -Osma
> 
> [1] https://github.com/NatLibFi/Skosmos/pull/1098
> 
> [2] https://jena.apache.org/tutorials/sparql_datasets.html
> 
> 
>