You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2012/07/01 17:33:23 UTC
Re: Rebuilding TDB index and updating stats file

On 30/06/12 21:04, Sarven Capadisli wrote:
> On 2012-06-30 12:48, Andy Seaborne wrote:
>> On 29/06/12 02:49, Sarven Capadisli wrote:
>>> On 2012-06-28 20:25, Andy Seaborne wrote:
>>>> On 28/06/12 10:11, Sarven Capadisli wrote:
>>>>> I was wondering if there is a way to rebuild the TDB index from
>>>>> command-line and have it consequently update the stats file?
>>>>
>>>> There isn't a way to rebuild just one of the indexes from another in
>>>> the
>>>> TDB distribution.  Is that you want to do?
>>>>
>>>> tdbstats calculates the stats.
>>>
>>> I want to optimize query response times.
>>>
>>> I can't get a satisfactory solution with tdbstats because it doesn't let
>>> me optimize for each named graph in the store.
>>
>> What sort of queries are you asking the store?
>
> For a store with 165 million triples, some real examples that's :
>
> SELECT DISTINCT ?o WHERE { ?s a ?o }
>
> Time: 159.359 sec (100 sec in second time round)
>
>
> SELECT DISTINCT ?o WHERE { GRAPH <http://worldbank.270a.info/graph/meta>
> { ?s a ?o } }
>
> Time: 0.394 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-bank-finances> { ?s a ?o } }
>
> Time: 1.946 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-bank-climates { ?s a ?o } }
>
> Time: 46.967 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-development-indicators> { ?s a
> ?o } }
>
> Time: 61.323 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-bank-projects-and-operations> {
> ?s a ?o } }
>
> Time: 0.559 sec
>
> A quick note on this: when I run the query where the default graph is
> the union of all graphs, it takes much longer in total than the total
> time for queries with different named graphs.
>
> Other examples:
>
> SELECT DISTINCT ?p ?o WHERE { GRAPH <g> ?s ?p ?o } --time=10m
>
> SELECT DISTINCT ?g WHERE { GRAPH ?g { } } --time=60s (49s in second
> round, 55s on third..)

The low-level optimizer, stats or otherwise, reorders the triples within 
a basic graph pattern.  In your example, there is only one triple 
pattern so there are no choices of ordering and the optimizer will make 
no difference.

SELECT DISTINCT ?o WHERE { ?s a ?o }

over the union default graph is an access to the POSG index.  P first 
because P = rdf:type is fixed.  TDB uses 3 indexes for the (real) 
default graph, 6 for named graphs, which means any access of G/S/P/O can 
be found from an index but not in every possible sort order (c.f. 
hexstore which has 6 indexes for the single graph) It would take 24 (= 
4*3*2*1) all possibilities of names graphs.

And when it is the union graph, the results have to be reduced to unique 
triples so { ?s a ?o } becomes what is effectively

DISTINCT ?s ?o { GRAPH ?g { ?s a ?o } }

Each triple pattern has to have the distinct-ness applied so it puts 
stress on memory as well.  If it were cleverer, it would know it could 
use a cheaper filter to calculate distinct-ness.

Also the system isn't smart enough to notice you have a DISTINCT of a 
unique expression and it does not need the outer DISTINCT.

Something similar happens for

  SELECT DISTINCT ?g WHERE { GRAPH ?g { } }

The thing that will most help performance is RAM.  How much RAM and on 
what sort of OS are you running?

	Andy