You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Bill Roberts <bi...@swirrl.com> on 2011/04/01 22:17:44 UTC

Fwd: Problems with TDB Optimizer

See below - sorry, realised this message more appropriate for jena-users than jena-dev

Begin forwarded message:

> From: Bill Roberts <bi...@swirrl.com>
> Date: 1 April 2011 19:44:20 GMT+01:00
> To: jena-dev@incubator.apache.org
> Bcc: Ric Roberts <ri...@swirrl.com>
> Subject: Problems with TDB Optimizer
> 
> I've come across some unexpected (to me!) behaviour of the TDB Optimizer and wondering if someone could shed any light on it.
> 
> For our database, (around 30 million triples, 350-odd different predicates, around 50 named graphs, using UnionDefaultGraph - everything is in a named graph), we've found that including the stats.opt file makes some queries significantly slower than having no optimizer.
> 
> Some relatively complex queries run quite quickly and probably a bit quicker with optimization than without.  But in other cases,  quite simple queries  run a lot slower - maybe 10 or 20 times slower with stats.opt in place than they do without it.
> 
> Is this known behaviour?
> 
> Here's an example: 
> 
> SELECT ?key ?label WHERE { <a-specific-uri> ?p ?key . ?key <http://www.w3.org/2000/01/rdf-schema#label> ?label }
> 
> This query took around 30 seconds with stats.opt in place, and less than 2 seconds without it. (Some of that 2 seconds would have been HTTP transfer and web page rendering time).
> 
> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit Ubuntu.  (Though I've found similar behaviour on 32-bit Ubuntu with slightly older versions of TDB and Joseki).
> 
> Thanks!
> 
> Bill

Re: Problems with TDB Optimizer

Posted by Andy Seaborne <an...@epimorphics.com>.

On 06/04/11 16:18, Bill Roberts wrote:
> Andy
>
> Sorry for slow response from me.  I tried adding the line ((TERM ANY
> ANY) 10) to the end of the stats.opt file and that increased the
> speed of that query to about the same as with no optimiser.  So
> thanks very much for that - it's kind of solved the problem, but I
> need to do some more tests on a broader range of queries, to find the
> cases where the optimiser is actively helping (as opposed to no
> longer slowing things down!)

I'm be very interested in what you find out.  I recently realised why 
the fixed optimizer does as well as it does - it has some built-in stats 
model that is data independent so on teh surface it looks like it should 
sometimes make bad choices. The fixed-ness is only used to choose the 
starting point after which it tends to execute connected triple patterns 
and so avoids intermediate cross products, which are really bad.

The stats format as generated tdbstats or tdbloader2 is meant to be a 
help, not a perfect stats file.

If you know <ifp> is is an inverse functional property,

((ANY <ifp> TERM) 1))

is the right thing to add.

> Presumably it's cases where there are multiple clauses in the query
> and the order of evaluating them is significant.

Yes.

>
> Anyway, thanks a lot for your help.
>
> Cheers
>
> Bill

	Andy

Re: Problems with TDB Optimizer

Posted by Bill Roberts <bi...@swirrl.com>.

Andy

Sorry for slow response from me.  I tried adding the line ((TERM ANY ANY) 10) to the end of the stats.opt file and that increased the speed of that query to about the same as with no optimiser.  So thanks very much for that - it's kind of solved the problem, but I need to do some more tests on a broader range of queries, to find the cases where the optimiser is actively helping (as opposed to no longer slowing things down!)

Presumably it's cases where there are multiple clauses in the query and the order of evaluating them is significant.

Anyway, thanks a lot for your help.

Cheers

Bill


On 4 Apr 2011, at 15:47, Andy Seaborne wrote:

> Hi Bill,
> 
> The stats optimizer is preferring the property it knows about to the variable ?p it does not.  If you add a stats rule to the file to tell the optimzier about a constant subject triple pattern: e.g. at the end of the stats file you sent me, add a line "((TERM ANY ANY) 10)"
> 
> ...
>  (<http://education.data.gov.uk/...> 24336)
>  ((TERM ANY ANY) 10)
>  (other 0))
> 
> 10 is a guess - lower numbers will increase the favour of the
> "<a-specific-uri> ?p ?key" part.
> 
> 
> Could you let me know if that changes things measurably on the real data?
> 
> Maybe this ought to always go in the stats file.  (That's needs careful thought because if its wrong, it's potentially a bit nasty.)
> 
> 	Andy
> 
> 
> 
> On 02/04/11 19:58, Andy Seaborne wrote:
>> Quick answer: longer to follow:
>> 
>> Could you try using the "fixed.opt", removing "stats.opt" and let me
>> know what happens?
>> 
>> Andy
>> 
>> 
>> On 01/04/11 21:17, Bill Roberts wrote:
>>> See below - sorry, realised this message more appropriate for
>>> jena-users than jena-dev
>>> 
>>> Begin forwarded message:
>>> 
>>>> From: Bill Roberts<bi...@swirrl.com>
>>>> Date: 1 April 2011 19:44:20 GMT+01:00
>>>> To: jena-dev@incubator.apache.org
>>>> Bcc: Ric Roberts<ri...@swirrl.com>
>>>> Subject: Problems with TDB Optimizer
>>>> 
>>>> I've come across some unexpected (to me!) behaviour of the TDB
>> Optimizer and wondering if someone could shed any light on it.
>>>> 
>>>> For our database, (around 30 million triples, 350-odd different
>> predicates, around 50 named graphs, using UnionDefaultGraph - everything
>> is in a named graph), we've found that including the stats.opt file
>> makes some queries significantly slower than having no optimizer.
>>>> 
>>>> Some relatively complex queries run quite quickly and probably a
>>>> bit
>> quicker with optimization than without. But in other cases, quite simple
>> queries run a lot slower - maybe 10 or 20 times slower with stats.opt in
>> place than they do without it.
>>>> 
>>>> Is this known behaviour?
>>>> 
>>>> Here's an example:
>>>> 
>>>> SELECT ?key ?label WHERE {<a-specific-uri> ?p ?key .
>> ?key<http://www.w3.org/2000/01/rdf-schema#label> ?label }
>>>> 
>>>> This query took around 30 seconds with stats.opt in place, and
>>>> less
>> than 2 seconds without it. (Some of that 2 seconds would have been HTTP
>> transfer and web page rendering time).
>>>> 
>>>> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit
>>>> Ubuntu.
>> (Though I've found similar behaviour on 32-bit Ubuntu with slightly
>> older versions of TDB and Joseki).
>>>> 
>>>> Thanks!
>>>> 
>>>> Bill
>>> 
>>>

Re: Fwd: Problems with TDB Optimizer

Posted by Andy Seaborne <an...@epimorphics.com>.

Hi Bill,

The stats optimizer is preferring the property it knows about to the 
variable ?p it does not.  If you add a stats rule to the file to tell 
the optimzier about a constant subject triple pattern: e.g. at the end 
of the stats file you sent me, add a line "((TERM ANY ANY) 10)"

...
   (<http://education.data.gov.uk/...> 24336)
   ((TERM ANY ANY) 10)
   (other 0))

10 is a guess - lower numbers will increase the favour of the
"<a-specific-uri> ?p ?key" part.


Could you let me know if that changes things measurably on the real data?

Maybe this ought to always go in the stats file.  (That's needs careful 
thought because if its wrong, it's potentially a bit nasty.)

	Andy



On 02/04/11 19:58, Andy Seaborne wrote:
> Quick answer: longer to follow:
>
> Could you try using the "fixed.opt", removing "stats.opt" and let me
> know what happens?
>
> Andy
>
>
> On 01/04/11 21:17, Bill Roberts wrote:
>> See below - sorry, realised this message more appropriate for
>> jena-users than jena-dev
>>
>> Begin forwarded message:
>>
>>> From: Bill Roberts<bi...@swirrl.com>
>>> Date: 1 April 2011 19:44:20 GMT+01:00
>>> To: jena-dev@incubator.apache.org
>>> Bcc: Ric Roberts<ri...@swirrl.com>
>>> Subject: Problems with TDB Optimizer
>>>
>>> I've come across some unexpected (to me!) behaviour of the TDB
> Optimizer and wondering if someone could shed any light on it.
>>>
>>> For our database, (around 30 million triples, 350-odd different
> predicates, around 50 named graphs, using UnionDefaultGraph - everything
> is in a named graph), we've found that including the stats.opt file
> makes some queries significantly slower than having no optimizer.
>>>
>>> Some relatively complex queries run quite quickly and probably a
>>> bit
> quicker with optimization than without. But in other cases, quite simple
> queries run a lot slower - maybe 10 or 20 times slower with stats.opt in
> place than they do without it.
>>>
>>> Is this known behaviour?
>>>
>>> Here's an example:
>>>
>>> SELECT ?key ?label WHERE {<a-specific-uri> ?p ?key .
> ?key<http://www.w3.org/2000/01/rdf-schema#label> ?label }
>>>
>>> This query took around 30 seconds with stats.opt in place, and
>>> less
> than 2 seconds without it. (Some of that 2 seconds would have been HTTP
> transfer and web page rendering time).
>>>
>>> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit
>>> Ubuntu.
> (Though I've found similar behaviour on 32-bit Ubuntu with slightly
> older versions of TDB and Joseki).
>>>
>>> Thanks!
>>>
>>> Bill
>>
>>

Re: Fwd: Problems with TDB Optimizer

Posted by Andy Seaborne <an...@epimorphics.com>.

Quick answer: longer to follow:

Could you try using the "fixed.opt", removing "stats.opt" and let me 
know what happens?

	Andy


On 01/04/11 21:17, Bill Roberts wrote:
> See below - sorry, realised this message more appropriate for jena-users than jena-dev
>
> Begin forwarded message:
>
>> From: Bill Roberts<bi...@swirrl.com>
>> Date: 1 April 2011 19:44:20 GMT+01:00
>> To: jena-dev@incubator.apache.org
>> Bcc: Ric Roberts<ri...@swirrl.com>
>> Subject: Problems with TDB Optimizer
>>
>> I've come across some unexpected (to me!) behaviour of the TDB
Optimizer and wondering if someone could shed any light on it.
>>
>> For our database, (around 30 million triples, 350-odd different
predicates, around 50 named graphs, using UnionDefaultGraph - everything
is in a named graph), we've found that including the stats.opt file
makes some queries significantly slower than having no optimizer.
>>
>> Some relatively complex queries run quite quickly and probably a
>> bit
quicker with optimization than without. But in other cases, quite simple
queries run a lot slower - maybe 10 or 20 times slower with stats.opt in
place than they do without it.
>>
>> Is this known behaviour?
>>
>> Here's an example:
>>
>> SELECT ?key ?label WHERE {<a-specific-uri> ?p ?key .
?key<http://www.w3.org/2000/01/rdf-schema#label> ?label }
>>
>> This query took around 30 seconds with stats.opt in place, and
>> less
than 2 seconds without it. (Some of that 2 seconds would have been HTTP
transfer and web page rendering time).
>>
>> We're currently running TDB 0.8.9 and Joseki 3.4.3 on 64 bit
>> Ubuntu.
(Though I've found similar behaviour on 32-bit Ubuntu with slightly
older versions of TDB and Joseki).
>>
>> Thanks!
>>
>> Bill
>
>