You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Bill Roberts <bi...@swirrl.com> on 2011/04/01 22:17:44 UTC
Fwd: Problems with TDB Optimizer
See below - sorry, realised this message more appropriate for jena-users than jena-dev
Begin forwarded message:
> From: Bill Roberts <bi...@swirrl.com>
> Date: 1 April 2011 19:44:20 GMT+01:00
> To: jena-dev@incubator.apache.org
> Bcc: Ric Roberts <ri...@swirrl.com>
> Subject: Problems with TDB Optimizer
>
> I've come across some unexpected (to me!) behaviour of the TDB Optimizer and wondering if someone could shed any light on it.
>
> For our database, (around 30 million triples, 350-odd different predicates, around 50 named graphs, using UnionDefaultGraph - everything is in a named graph), we've found that including the stats.opt file makes some queries significantly slower than having no optimizer.
>
> Some relatively complex queries run quite quickly and probably a bit quicker with optimization than without. But in other cases, quite simple queries run a lot slower - maybe 10 or 20 times slower with stats.opt in place than they do without it.
>
> Is this known behaviour?
>
> Here's an example:
>
> SELECT ?key ?label WHERE { <a-specific-uri> ?p ?key . ?key <http://www.w3.org/2000/01/rdf-schema#label> ?label }
>
> This query took around 30 seconds with stats.opt in place, and less than 2 seconds without it. (Some of that 2 seconds would have been HTTP transfer and web page rendering time).
>
> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit Ubuntu. (Though I've found similar behaviour on 32-bit Ubuntu with slightly older versions of TDB and Joseki).
>
> Thanks!
>
> Bill
Re: Problems with TDB Optimizer
Posted by Andy Seaborne <an...@epimorphics.com>.
On 06/04/11 16:18, Bill Roberts wrote:
> Andy
>
> Sorry for slow response from me. I tried adding the line ((TERM ANY
> ANY) 10) to the end of the stats.opt file and that increased the
> speed of that query to about the same as with no optimiser. So
> thanks very much for that - it's kind of solved the problem, but I
> need to do some more tests on a broader range of queries, to find the
> cases where the optimiser is actively helping (as opposed to no
> longer slowing things down!)
I'm be very interested in what you find out. I recently realised why
the fixed optimizer does as well as it does - it has some built-in stats
model that is data independent so on teh surface it looks like it should
sometimes make bad choices. The fixed-ness is only used to choose the
starting point after which it tends to execute connected triple patterns
and so avoids intermediate cross products, which are really bad.
The stats format as generated tdbstats or tdbloader2 is meant to be a
help, not a perfect stats file.
If you know <ifp> is is an inverse functional property,
((ANY <ifp> TERM) 1))
is the right thing to add.
> Presumably it's cases where there are multiple clauses in the query
> and the order of evaluating them is significant.
Yes.
>
> Anyway, thanks a lot for your help.
>
> Cheers
>
> Bill
Andy
Re: Problems with TDB Optimizer
Posted by Bill Roberts <bi...@swirrl.com>.
Andy
Sorry for slow response from me. I tried adding the line ((TERM ANY ANY) 10) to the end of the stats.opt file and that increased the speed of that query to about the same as with no optimiser. So thanks very much for that - it's kind of solved the problem, but I need to do some more tests on a broader range of queries, to find the cases where the optimiser is actively helping (as opposed to no longer slowing things down!)
Presumably it's cases where there are multiple clauses in the query and the order of evaluating them is significant.
Anyway, thanks a lot for your help.
Cheers
Bill
On 4 Apr 2011, at 15:47, Andy Seaborne wrote:
> Hi Bill,
>
> The stats optimizer is preferring the property it knows about to the variable ?p it does not. If you add a stats rule to the file to tell the optimzier about a constant subject triple pattern: e.g. at the end of the stats file you sent me, add a line "((TERM ANY ANY) 10)"
>
> ...
> (<http://education.data.gov.uk/...> 24336)
> ((TERM ANY ANY) 10)
> (other 0))
>
> 10 is a guess - lower numbers will increase the favour of the
> "<a-specific-uri> ?p ?key" part.
>
>
> Could you let me know if that changes things measurably on the real data?
>
> Maybe this ought to always go in the stats file. (That's needs careful thought because if its wrong, it's potentially a bit nasty.)
>
> Andy
>
>
>
> On 02/04/11 19:58, Andy Seaborne wrote:
>> Quick answer: longer to follow:
>>
>> Could you try using the "fixed.opt", removing "stats.opt" and let me
>> know what happens?
>>
>> Andy
>>
>>
>> On 01/04/11 21:17, Bill Roberts wrote:
>>> See below - sorry, realised this message more appropriate for
>>> jena-users than jena-dev
>>>
>>> Begin forwarded message:
>>>
>>>> From: Bill Roberts<bi...@swirrl.com>
>>>> Date: 1 April 2011 19:44:20 GMT+01:00
>>>> To: jena-dev@incubator.apache.org
>>>> Bcc: Ric Roberts<ri...@swirrl.com>
>>>> Subject: Problems with TDB Optimizer
>>>>
>>>> I've come across some unexpected (to me!) behaviour of the TDB
>> Optimizer and wondering if someone could shed any light on it.
>>>>
>>>> For our database, (around 30 million triples, 350-odd different
>> predicates, around 50 named graphs, using UnionDefaultGraph - everything
>> is in a named graph), we've found that including the stats.opt file
>> makes some queries significantly slower than having no optimizer.
>>>>
>>>> Some relatively complex queries run quite quickly and probably a
>>>> bit
>> quicker with optimization than without. But in other cases, quite simple
>> queries run a lot slower - maybe 10 or 20 times slower with stats.opt in
>> place than they do without it.
>>>>
>>>> Is this known behaviour?
>>>>
>>>> Here's an example:
>>>>
>>>> SELECT ?key ?label WHERE {<a-specific-uri> ?p ?key .
>> ?key<http://www.w3.org/2000/01/rdf-schema#label> ?label }
>>>>
>>>> This query took around 30 seconds with stats.opt in place, and
>>>> less
>> than 2 seconds without it. (Some of that 2 seconds would have been HTTP
>> transfer and web page rendering time).
>>>>
>>>> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit
>>>> Ubuntu.
>> (Though I've found similar behaviour on 32-bit Ubuntu with slightly
>> older versions of TDB and Joseki).
>>>>
>>>> Thanks!
>>>>
>>>> Bill
>>>
>>>
Re: Fwd: Problems with TDB Optimizer
Posted by Andy Seaborne <an...@epimorphics.com>.
Hi Bill,
The stats optimizer is preferring the property it knows about to the
variable ?p it does not. If you add a stats rule to the file to tell
the optimzier about a constant subject triple pattern: e.g. at the end
of the stats file you sent me, add a line "((TERM ANY ANY) 10)"
...
(<http://education.data.gov.uk/...> 24336)
((TERM ANY ANY) 10)
(other 0))
10 is a guess - lower numbers will increase the favour of the
"<a-specific-uri> ?p ?key" part.
Could you let me know if that changes things measurably on the real data?
Maybe this ought to always go in the stats file. (That's needs careful
thought because if its wrong, it's potentially a bit nasty.)
Andy
On 02/04/11 19:58, Andy Seaborne wrote:
> Quick answer: longer to follow:
>
> Could you try using the "fixed.opt", removing "stats.opt" and let me
> know what happens?
>
> Andy
>
>
> On 01/04/11 21:17, Bill Roberts wrote:
>> See below - sorry, realised this message more appropriate for
>> jena-users than jena-dev
>>
>> Begin forwarded message:
>>
>>> From: Bill Roberts<bi...@swirrl.com>
>>> Date: 1 April 2011 19:44:20 GMT+01:00
>>> To: jena-dev@incubator.apache.org
>>> Bcc: Ric Roberts<ri...@swirrl.com>
>>> Subject: Problems with TDB Optimizer
>>>
>>> I've come across some unexpected (to me!) behaviour of the TDB
> Optimizer and wondering if someone could shed any light on it.
>>>
>>> For our database, (around 30 million triples, 350-odd different
> predicates, around 50 named graphs, using UnionDefaultGraph - everything
> is in a named graph), we've found that including the stats.opt file
> makes some queries significantly slower than having no optimizer.
>>>
>>> Some relatively complex queries run quite quickly and probably a
>>> bit
> quicker with optimization than without. But in other cases, quite simple
> queries run a lot slower - maybe 10 or 20 times slower with stats.opt in
> place than they do without it.
>>>
>>> Is this known behaviour?
>>>
>>> Here's an example:
>>>
>>> SELECT ?key ?label WHERE {<a-specific-uri> ?p ?key .
> ?key<http://www.w3.org/2000/01/rdf-schema#label> ?label }
>>>
>>> This query took around 30 seconds with stats.opt in place, and
>>> less
> than 2 seconds without it. (Some of that 2 seconds would have been HTTP
> transfer and web page rendering time).
>>>
>>> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit
>>> Ubuntu.
> (Though I've found similar behaviour on 32-bit Ubuntu with slightly
> older versions of TDB and Joseki).
>>>
>>> Thanks!
>>>
>>> Bill
>>
>>
Re: Fwd: Problems with TDB Optimizer
Posted by Andy Seaborne <an...@epimorphics.com>.
Quick answer: longer to follow:
Could you try using the "fixed.opt", removing "stats.opt" and let me
know what happens?
Andy
On 01/04/11 21:17, Bill Roberts wrote:
> See below - sorry, realised this message more appropriate for jena-users than jena-dev
>
> Begin forwarded message:
>
>> From: Bill Roberts<bi...@swirrl.com>
>> Date: 1 April 2011 19:44:20 GMT+01:00
>> To: jena-dev@incubator.apache.org
>> Bcc: Ric Roberts<ri...@swirrl.com>
>> Subject: Problems with TDB Optimizer
>>
>> I've come across some unexpected (to me!) behaviour of the TDB
Optimizer and wondering if someone could shed any light on it.
>>
>> For our database, (around 30 million triples, 350-odd different
predicates, around 50 named graphs, using UnionDefaultGraph - everything
is in a named graph), we've found that including the stats.opt file
makes some queries significantly slower than having no optimizer.
>>
>> Some relatively complex queries run quite quickly and probably a
>> bit
quicker with optimization than without. But in other cases, quite simple
queries run a lot slower - maybe 10 or 20 times slower with stats.opt in
place than they do without it.
>>
>> Is this known behaviour?
>>
>> Here's an example:
>>
>> SELECT ?key ?label WHERE {<a-specific-uri> ?p ?key .
?key<http://www.w3.org/2000/01/rdf-schema#label> ?label }
>>
>> This query took around 30 seconds with stats.opt in place, and
>> less
than 2 seconds without it. (Some of that 2 seconds would have been HTTP
transfer and web page rendering time).
>>
>> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit
>> Ubuntu.
(Though I've found similar behaviour on 32-bit Ubuntu with slightly
older versions of TDB and Joseki).
>>
>> Thanks!
>>
>> Bill
>
>