You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Andy Seaborne <an...@apache.org> on 2012/07/01 17:33:23 UTC
Re: Rebuilding TDB index and updating stats file
On 30/06/12 21:04, Sarven Capadisli wrote:
> On 2012-06-30 12:48, Andy Seaborne wrote:
>> On 29/06/12 02:49, Sarven Capadisli wrote:
>>> On 2012-06-28 20:25, Andy Seaborne wrote:
>>>> On 28/06/12 10:11, Sarven Capadisli wrote:
>>>>> I was wondering if there is a way to rebuild the TDB index from
>>>>> command-line and have it consequently update the stats file?
>>>>
>>>> There isn't a way to rebuild just one of the indexes from another in
>>>> the
>>>> TDB distribution. Is that you want to do?
>>>>
>>>> tdbstats calculates the stats.
>>>
>>> I want to optimize query response times.
>>>
>>> I can't get a satisfactory solution with tdbstats because it doesn't let
>>> me optimize for each named graph in the store.
>>
>> What sort of queries are you asking the store?
>
> For a store with 165 million triples, some real examples that's :
>
> SELECT DISTINCT ?o WHERE { ?s a ?o }
>
> Time: 159.359 sec (100 sec in second time round)
>
>
> SELECT DISTINCT ?o WHERE { GRAPH <http://worldbank.270a.info/graph/meta>
> { ?s a ?o } }
>
> Time: 0.394 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-bank-finances> { ?s a ?o } }
>
> Time: 1.946 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-bank-climates { ?s a ?o } }
>
> Time: 46.967 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-development-indicators> { ?s a
> ?o } }
>
> Time: 61.323 sec
>
> SELECT DISTINCT ?o WHERE { GRAPH
> <http://worldbank.270a.info/graph/world-bank-projects-and-operations> {
> ?s a ?o } }
>
> Time: 0.559 sec
>
> A quick note on this: when I run the query where the default graph is
> the union of all graphs, it takes much longer in total than the total
> time for queries with different named graphs.
>
> Other examples:
>
> SELECT DISTINCT ?p ?o WHERE { GRAPH <g> ?s ?p ?o } --time=10m
>
> SELECT DISTINCT ?g WHERE { GRAPH ?g { } } --time=60s (49s in second
> round, 55s on third..)
The low-level optimizer, stats or otherwise, reorders the triples within
a basic graph pattern. In your example, there is only one triple
pattern so there are no choices of ordering and the optimizer will make
no difference.
SELECT DISTINCT ?o WHERE { ?s a ?o }
over the union default graph is an access to the POSG index. P first
because P = rdf:type is fixed. TDB uses 3 indexes for the (real)
default graph, 6 for named graphs, which means any access of G/S/P/O can
be found from an index but not in every possible sort order (c.f.
hexstore which has 6 indexes for the single graph) It would take 24 (=
4*3*2*1) all possibilities of names graphs.
And when it is the union graph, the results have to be reduced to unique
triples so { ?s a ?o } becomes what is effectively
DISTINCT ?s ?o { GRAPH ?g { ?s a ?o } }
Each triple pattern has to have the distinct-ness applied so it puts
stress on memory as well. If it were cleverer, it would know it could
use a cheaper filter to calculate distinct-ness.
Also the system isn't smart enough to notice you have a DISTINCT of a
unique expression and it does not need the outer DISTINCT.
Something similar happens for
SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
The thing that will most help performance is RAM. How much RAM and on
what sort of OS are you running?
Andy