You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Glenn Proctor <gl...@eaglegenomics.com> on 2012/03/06 10:14:03 UTC

Performance of successive identical queries

Hi

I have a TDB instance (0.8.10) containing about 207m triples. I've run
tdbstats and moved stats.opt into the appropriate place.

I've noticed that running the same query multiple times in succession
results in successively shorter query times, up to a point. For
example, on an otherwise-idle TDB instance, the query

SELECT ?facet ?val (COUNT(?val) as ?vc) WHERE { ?id a ?val . ?id
?facet ?val . } GROUP BY ?facet ?val ORDER BY DESC(?vc) LIMIT 25

Takes 3707s, then 1424s, then 345s where it seems to stay for subsequent runs.

What's the reason for this initial improvement and subsequent tailing
off - are the indexes being optimised with every query?

Glenn.

Re: indexing [was Re: Performance of successive identical queries]

Posted by Andy Seaborne <an...@apache.org>.

On 07/03/12 21:05, Andy Seaborne wrote:
> On 06/03/12 21:22, Rob Vesse wrote:
>> If I might throw my 2 cents into the mix...
>>
>> In dotNetRDF in the recent releases (2 weeks ago) we added the ability
>> to automatically have a dataset linked to a full text index and keep
>> that index in sync with changes in the dataset. My approach to this was
>> to use the decorator pattern, so what I have is a base decorator [1]
>> which is simply an implementation of our dataset interface which passes
>> through all calls to the underlying dataset. We then have a decorator
>> [2] which extends this base class and adds the logic to intercept the
>> calls that alter the dataset so that it updates the index as well as
>> passing the call through to the underlying dataset.
>>
>> Since all updates go through the dataset interface this allows us to
>> catch all updates and keep the full text index up to date. Whether this
>> is applicable to Jena or not depends on whether all updates go through a
>> single dataset interface in Jena which is a part of the code base I am
>> not so familiar with?
>>
>> Rob
>>
>> [1]
>> http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/core/Query/Datasets/WrapperDataset.cs?revision=2157&view=markup
>>
>>
>> [2]
>> http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/query.fulltext/Datasets/FullTextIndexedDataset.cs?revision=2157&view=markup
>>
>
> Other than the fact it's called *Wrapper, the decorator pattern is used
> in ARQ and TDB in various places.
>
> TDB used to support graphs without datasets, so it's a bit more mixed
> than just catching DatasetGraph. But that's history.
>
> We could change GraphTDBBase to use DatasetGraph and so everything goes
> via DatasetGraph .... or even junk GraphTDB altogether and have
> standardised graphs-over-datasetgraphs.
>
> SPARQL Query bypasses all this as does bulkloading.
>
> SPARQL Update does use DatasetGraph.
>
> see DatasetGraphWrapper
>
> Eclipse practically writes such classes if you implement an interface
> and use "quick fix".
>
> Andy
>

OK - it's not that simple :-)

GraphTDB also provides access to internal structures for the query 
engine etc.  It avoids the need keep casting to get from "graph" to 
tuple tables that would be need if it were generic and gave access to 
the dataset.  I avoid excessive casting designs because of the chance a 
non-thing is passed in.

So GraphTDB sits along side DatasetGraphTDB and both need wrappers at 
the moment, if you want to trap API update calls.

	Andy

Re: indexing [was Re: Performance of successive identical queries]

Posted by Andy Seaborne <an...@apache.org>.

On 06/03/12 21:22, Rob Vesse wrote:
> If I might throw my 2 cents into the mix...
>
> In dotNetRDF in the recent releases (2 weeks ago) we added the ability
> to automatically have a dataset linked to a full text index and keep
> that index in sync with changes in the dataset. My approach to this was
> to use the decorator pattern, so what I have is a base decorator [1]
> which is simply an implementation of our dataset interface which passes
> through all calls to the underlying dataset. We then have a decorator
> [2] which extends this base class and adds the logic to intercept the
> calls that alter the dataset so that it updates the index as well as
> passing the call through to the underlying dataset.
>
> Since all updates go through the dataset interface this allows us to
> catch all updates and keep the full text index up to date. Whether this
> is applicable to Jena or not depends on whether all updates go through a
> single dataset interface in Jena which is a part of the code base I am
> not so familiar with?
>
> Rob
>
> [1]
> http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/core/Query/Datasets/WrapperDataset.cs?revision=2157&view=markup
>
> [2]
> http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/query.fulltext/Datasets/FullTextIndexedDataset.cs?revision=2157&view=markup

Other than the fact it's called *Wrapper, the decorator pattern is used 
in ARQ and TDB in various places.

TDB used to support graphs without datasets, so it's a bit more mixed 
than just catching DatasetGraph.  But that's history.

We could change GraphTDBBase to use DatasetGraph and so everything goes 
via DatasetGraph .... or even junk GraphTDB altogether and have 
standardised graphs-over-datasetgraphs.

SPARQL Query bypasses all this as does bulkloading.

SPARQL Update does use DatasetGraph.

see DatasetGraphWrapper

Eclipse practically writes such classes if you implement an interface 
and use "quick fix".

	Andy

Re: indexing [was Re: Performance of successive identical queries]

Posted by Rob Vesse <ra...@ecs.soton.ac.uk>.

If I might throw my 2 cents into the mix...

In dotNetRDF in the recent releases (2 weeks ago) we added the ability 
to automatically have a dataset linked to a full text index and keep 
that index in sync with changes in the dataset.  My approach to this was 
to use the decorator pattern, so what I have is a base decorator [1] 
which is simply an implementation of our dataset interface which passes 
through all calls to the underlying dataset.  We then have a decorator 
[2] which extends this base class and adds the logic to intercept the 
calls that alter the dataset so that it updates the index as well as 
passing the call through to the underlying dataset.

Since all updates go through the dataset interface this allows us to 
catch all updates and keep the full text index up to date.  Whether this 
is applicable to Jena or not depends on whether all updates go through a 
single dataset interface in Jena which is a part of the code base I am 
not so familiar with?

Rob

[1] 
http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/core/Query/Datasets/WrapperDataset.cs?revision=2157&view=markup
[2] 
http://dotnetrdf.svn.sourceforge.net/viewvc/dotnetrdf/Trunk/Libraries/query.fulltext/Datasets/FullTextIndexedDataset.cs?revision=2157&view=markup

On 3/6/12 10:08 AM, Paolo Castagna wrote:
> Hi Alexander,
> thank you for sharing with us details about data.ox.ac.uk and pointing
> me at https://github.com/oucs/humfrey (who needs documentation when you
> have the source code? ;-)).
>
> I have been thinking about pros/cons of having a custom/additional index
> coupled with a TDB dataset and keeping it up-to-date (and/or in sync with
> TDB).
>
> I see two approaches:
>
>   1. internal to Jena
>       - pros
>          - simplicity for users, it works out of the box
>          - ...
>       - cons
>          - it requires an internal notification sub-system (which we have,
>            but it does not covers all possible paths and it might impact
>            performances)
>          - it might create expectations that indexes will never go out of
>            sync (while it might happen)
>          - ...
>   2. external to Jena
>       - pros
>          - relatively easy to implement assuming SPARQL and external index
>            APIs
>          - ...
>       - cons
>          - it requires an additional service
>          - it isn't simple (or possible) with certain SPARQL Update requests
>          - ...
>
> Re: sanity/insanity, I don't comment.
>
> We have a system which intercept all the update requests, it puts into a
> sort of key-value store in S3 with a cache in front of it. We have a
> queuing/messaging system which applies changes to replicas on different
> nodes taking stuff from there. Nodes can be different types: RDF stores,
> free-text indexes, etc. In this scenario, update requests cannot be
> unconstrained SPARQL queries, but you can replay updates and apply them
> to different type of nodes/indexes. Some of the stuff is available here:
> https://github.com/talis/
>
> I imagine it is the case for you as well, it's not something you can
> just download, unzip and run as it is. This sort of simplicity is
> something IMHO not to underestimate and it is what drives me towards
> option 1. above.
>
> Knowing what others are doing it is certainly useful to better understand
> what's needed.
>
> Thanks,
> Paolo
>
> PS:
> I am not going to ask you what you do for: monitoring, backups,
> high-availability, load balancing, etc. ;-)
>
> Alexander Dutton wrote:
>> Hi Paolo,
>>
>> On 06/03/12 14:56, Paolo Castagna wrote:
>>> Alexander Dutton wrote:
>>>> This is the way we're going with our site, data.ox.ac.uk. After
>>>> each update to the triplestore we'll regenerate an ElasticSearch
>>>> index from a SPARQL query. […]
>>> interesting...
>>>
>>> How do you update your triplestore (SPARQL Update, Jena APIs via
>>> custom code, manually from command line, ...)?
>> Our administration interface manages grabbing data from elsewhere,
>> transforming it in various ways, and then uses the graph store HTTP
>> protocol to push it into Fuseki. Once that's done it fires off a
>> notification on a redis pubsub channel to say "this update just completed".
>>
>> There's then something that listens on the relevant channel which will
>> perform the ElasticSearch update. (There are other things that handle
>> uploading dataset metadata to thedatahub, and archiving datasets for
>> bulk download).
>>
>> There's code at https://github.com/oucs/humfrey, but it's a bit of a
>> nightmare to set up and (surprise, surprise) lacks documentation. The
>> ElasticSearch stuff is still in development on the elasticsearch branch.
>> At some point I'll find the time to make it easier to install and create
>> a demo site. (as you may have noticed, the whole thing is an eclectic
>> mix of technologies; Django, ElasticSearch, redis, PostgreSQL, Apache
>> httpd…)
>>
>>> We (still) have two related JIRA 'issues':
>>>
>>> - LARQ needs to update the Lucene index when a SPARQL Update request
>>> is received https://issues.apache.org/jira/browse/JENA-164
>>>
>>> - Refactor LARQ so that it becomes easy to plug in different indexes
>>> such as Solr or ElasticSearch instead of Lucene
>>> https://issues.apache.org/jira/browse/JENA-17
>>>
>>> I am still unclear how to intercept all the possible update routes
>>> (i.e. SPARQL Update, APIs, bulk loaders, etc...).
>> Our approach is to limit the ways in which updates can happen (i.e.
>> things will become inconsistent if it doesn't happen through our admin
>> interface). This obviously doesn't work in the general case, but could
>> be a useful half-way house (e.g. say "'INSERT … WHERE …' will leave you
>> with a stale index. If you care, use 'CONSTRUCT' and the graph store
>> protocol instead").
>>
>>> But, I think it would be useful to allow people to use Apache Solr
>>> and/or ElasticSearch indexes (and/or other custom indexes) and keep
>>> those up-to- date when changes come in.
>> For external indexes presumably you either need something that gets
>> hooked into the JVM and listens for updates there, or a way to push
>> notifications to external applications/services when things happen.
>>
>>> What do you store in ElasticSearch?
>> Technically, nothing yet, as I'm still implementing it ;-). Once it's
>> implemented it'll build indexes tailored to the types of modelling
>> patterns we expect to have in the store. For example, we might SPARQL
>> for organisations like<http://is.gd/gsc1Zs>  and for each create a chunk
>> of JSON to feed into ElasticSearch. Targets for indexing so far include
>> organisations, people, vacancies, courses, and equipment. We'll add more
>> indexes as we add new types of things.
>>
>>
>> All the best,
>>
>> Alex
>>
>> PS. I'd be interested to know whether our approach is generally
>> considered sane…

Re: (external) indexing [was Re: Performance of successive identical queries]

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Alexander,
thank you for sharing with us details about data.ox.ac.uk and pointing
me at https://github.com/oucs/humfrey (who needs documentation when you
have the source code? ;-)).

I have been thinking about pros/cons of having a custom/additional index
coupled with a TDB dataset and keeping it up-to-date (and/or in sync with
TDB).

I see two approaches:

 1. internal to Jena
     - pros
        - simplicity for users, it works out of the box
        - ...
     - cons
        - it requires an internal notification sub-system (which we have,
          but it does not covers all possible paths and it might impact
          performances)
        - it might create expectations that indexes will never go out of
          sync (while it might happen)
        - ...
 2. external to Jena
     - pros
        - relatively easy to implement assuming SPARQL and external index
          APIs
        - ...
     - cons
        - it requires an additional service
        - it isn't simple (or possible) with certain SPARQL Update requests
        - ...

Re: sanity/insanity, I don't comment.

We have a system which intercept all the update requests, it puts into a
sort of key-value store in S3 with a cache in front of it. We have a
queuing/messaging system which applies changes to replicas on different
nodes taking stuff from there. Nodes can be different types: RDF stores,
free-text indexes, etc. In this scenario, update requests cannot be
unconstrained SPARQL queries, but you can replay updates and apply them
to different type of nodes/indexes. Some of the stuff is available here:
https://github.com/talis/

I imagine it is the case for you as well, it's not something you can
just download, unzip and run as it is. This sort of simplicity is
something IMHO not to underestimate and it is what drives me towards
option 1. above.

Knowing what others are doing it is certainly useful to better understand
what's needed.

Thanks,
Paolo

PS:
I am not going to ask you what you do for: monitoring, backups,
high-availability, load balancing, etc. ;-)

Alexander Dutton wrote:
> Hi Paolo,
> 
> On 06/03/12 14:56, Paolo Castagna wrote:
>> Alexander Dutton wrote:
>>> This is the way we're going with our site, data.ox.ac.uk. After
>>> each update to the triplestore we'll regenerate an ElasticSearch
>>> index from a SPARQL query. […]
>> interesting...
>>
>> How do you update your triplestore (SPARQL Update, Jena APIs via
>> custom code, manually from command line, ...)?
> 
> Our administration interface manages grabbing data from elsewhere,
> transforming it in various ways, and then uses the graph store HTTP
> protocol to push it into Fuseki. Once that's done it fires off a
> notification on a redis pubsub channel to say "this update just completed".
> 
> There's then something that listens on the relevant channel which will
> perform the ElasticSearch update. (There are other things that handle
> uploading dataset metadata to thedatahub, and archiving datasets for
> bulk download).
> 
> There's code at https://github.com/oucs/humfrey, but it's a bit of a
> nightmare to set up and (surprise, surprise) lacks documentation. The
> ElasticSearch stuff is still in development on the elasticsearch branch.
> At some point I'll find the time to make it easier to install and create
> a demo site. (as you may have noticed, the whole thing is an eclectic
> mix of technologies; Django, ElasticSearch, redis, PostgreSQL, Apache
> httpd…)
> 
>> We (still) have two related JIRA 'issues':
>>
>> - LARQ needs to update the Lucene index when a SPARQL Update request
>> is received https://issues.apache.org/jira/browse/JENA-164
>>
>> - Refactor LARQ so that it becomes easy to plug in different indexes
>> such as Solr or ElasticSearch instead of Lucene
>> https://issues.apache.org/jira/browse/JENA-17
>>
>> I am still unclear how to intercept all the possible update routes
>> (i.e. SPARQL Update, APIs, bulk loaders, etc...).
> 
> Our approach is to limit the ways in which updates can happen (i.e.
> things will become inconsistent if it doesn't happen through our admin
> interface). This obviously doesn't work in the general case, but could
> be a useful half-way house (e.g. say "'INSERT … WHERE …' will leave you
> with a stale index. If you care, use 'CONSTRUCT' and the graph store
> protocol instead").
> 
>> But, I think it would be useful to allow people to use Apache Solr
>> and/or ElasticSearch indexes (and/or other custom indexes) and keep
>> those up-to- date when changes come in.
> 
> For external indexes presumably you either need something that gets
> hooked into the JVM and listens for updates there, or a way to push
> notifications to external applications/services when things happen.
> 
>> What do you store in ElasticSearch?
> 
> Technically, nothing yet, as I'm still implementing it ;-). Once it's
> implemented it'll build indexes tailored to the types of modelling
> patterns we expect to have in the store. For example, we might SPARQL
> for organisations like <http://is.gd/gsc1Zs> and for each create a chunk
> of JSON to feed into ElasticSearch. Targets for indexing so far include
> organisations, people, vacancies, courses, and equipment. We'll add more
> indexes as we add new types of things.
> 
> 
> All the best,
> 
> Alex
> 
> PS. I'd be interested to know whether our approach is generally
> considered sane…

(external) indexing [was Re: Performance of successive identical queries]

Posted by Alexander Dutton <al...@oucs.ox.ac.uk>.

Hi Paolo,

On 06/03/12 14:56, Paolo Castagna wrote:
> Alexander Dutton wrote:
>> This is the way we're going with our site, data.ox.ac.uk. After
>> each update to the triplestore we'll regenerate an ElasticSearch
>> index from a SPARQL query. […]
>
> interesting...
>
> How do you update your triplestore (SPARQL Update, Jena APIs via
> custom code, manually from command line, ...)?

Our administration interface manages grabbing data from elsewhere,
transforming it in various ways, and then uses the graph store HTTP
protocol to push it into Fuseki. Once that's done it fires off a
notification on a redis pubsub channel to say "this update just completed".

There's then something that listens on the relevant channel which will
perform the ElasticSearch update. (There are other things that handle
uploading dataset metadata to thedatahub, and archiving datasets for
bulk download).

There's code at https://github.com/oucs/humfrey, but it's a bit of a
nightmare to set up and (surprise, surprise) lacks documentation. The
ElasticSearch stuff is still in development on the elasticsearch branch.
At some point I'll find the time to make it easier to install and create
a demo site. (as you may have noticed, the whole thing is an eclectic
mix of technologies; Django, ElasticSearch, redis, PostgreSQL, Apache
httpd…)

> We (still) have two related JIRA 'issues':
>
> - LARQ needs to update the Lucene index when a SPARQL Update request
> is received https://issues.apache.org/jira/browse/JENA-164
>
> - Refactor LARQ so that it becomes easy to plug in different indexes
> such as Solr or ElasticSearch instead of Lucene
> https://issues.apache.org/jira/browse/JENA-17
>
> I am still unclear how to intercept all the possible update routes
> (i.e. SPARQL Update, APIs, bulk loaders, etc...).

Our approach is to limit the ways in which updates can happen (i.e.
things will become inconsistent if it doesn't happen through our admin
interface). This obviously doesn't work in the general case, but could
be a useful half-way house (e.g. say "'INSERT … WHERE …' will leave you
with a stale index. If you care, use 'CONSTRUCT' and the graph store
protocol instead").

> But, I think it would be useful to allow people to use Apache Solr
> and/or ElasticSearch indexes (and/or other custom indexes) and keep
> those up-to- date when changes come in.

For external indexes presumably you either need something that gets
hooked into the JVM and listens for updates there, or a way to push
notifications to external applications/services when things happen.

> What do you store in ElasticSearch?

Technically, nothing yet, as I'm still implementing it ;-). Once it's
implemented it'll build indexes tailored to the types of modelling
patterns we expect to have in the store. For example, we might SPARQL
for organisations like <http://is.gd/gsc1Zs> and for each create a chunk
of JSON to feed into ElasticSearch. Targets for indexing so far include
organisations, people, vacancies, courses, and equipment. We'll add more
indexes as we add new types of things.

All the best,

Alex

PS. I'd be interested to know whether our approach is generally
considered sane…

Re: Performance of successive identical queries

Posted by Paolo Castagna <ca...@googlemail.com>.

Alexander Dutton wrote:
> Hi Glenn,
> 
> On 06/03/12 09:57, Glenn Proctor wrote:
>> On Tue, Mar 6, 2012 at 9:49 AM, Paolo Castagna wrote:
>>
>>> A completely different alternative would be to use something such as
>>> Apache Solr or ElasticSearch along side your TDB store, they both
>>> support facet searches (and they can be quite fast): […]
>> Thanks Paulo - caching is indeed the next thing on the list to look at
>> :) I hadn't considered the Solr approach, but will do so.
> 
> This is the way we're going with our site, data.ox.ac.uk. After each
> update to the triplestore we'll regenerate an ElasticSearch index from a
> SPARQL query. It won't quite be real time, but should be good enough for
> our purposes. From the small amount of playing around with ElasticSearch
> I've done, it seems to 'Just Work'.

Hi Alex,
interesting...

How do you update your triplestore (SPARQL Update, Jena APIs via custom
code, manually from command line, ...)?

We (still) have two related JIRA 'issues':

 - LARQ needs to update the Lucene index when a SPARQL Update request
   is received
   https://issues.apache.org/jira/browse/JENA-164

 - Refactor LARQ so that it becomes easy to plug in different indexes such
   as Solr or ElasticSearch instead of Lucene
   https://issues.apache.org/jira/browse/JENA-17

I am still unclear how to intercept all the possible update routes (i.e.
SPARQL Update, APIs, bulk loaders, etc...).

But, I think it would be useful to allow people to use Apache Solr and/or
ElasticSearch indexes (and/or other custom indexes) and keep those up-to-
date when changes come in.

What do you store in ElasticSearch?

Paolo

> 
> I learnt about it from these people:
> http://cottagelabs.com/indexing-elastic-search-workshop-from-dev8d/.
> They've also got some JavaScript for faceted interfaces in
> https://github.com/CottageLabs/edjo and
> https://github.com/okfn/bibserver (the latter of which powers
> http://bibsoup.net/kcoyle/publications_karen_coyle). I'm not sure how
> repurposable it is…
> 
> All the best,
> 
> Alex

Re: Performance of successive identical queries

Posted by Alexander Dutton <al...@oucs.ox.ac.uk>.

Hi Glenn,

On 06/03/12 09:57, Glenn Proctor wrote:
> On Tue, Mar 6, 2012 at 9:49 AM, Paolo Castagna wrote:
>
>> A completely different alternative would be to use something such as
>> Apache Solr or ElasticSearch along side your TDB store, they both
>> support facet searches (and they can be quite fast): […]
> Thanks Paulo - caching is indeed the next thing on the list to look at
> :) I hadn't considered the Solr approach, but will do so.

This is the way we're going with our site, data.ox.ac.uk. After each
update to the triplestore we'll regenerate an ElasticSearch index from a
SPARQL query. It won't quite be real time, but should be good enough for
our purposes. From the small amount of playing around with ElasticSearch
I've done, it seems to 'Just Work'.

I learnt about it from these people:
http://cottagelabs.com/indexing-elastic-search-workshop-from-dev8d/.
They've also got some JavaScript for faceted interfaces in
https://github.com/CottageLabs/edjo and
https://github.com/okfn/bibserver (the latter of which powers
http://bibsoup.net/kcoyle/publications_karen_coyle). I'm not sure how
repurposable it is…

All the best,

Alex

Re: Performance of successive identical queries

Posted by Glenn Proctor <gl...@eaglegenomics.com>.

On Tue, Mar 6, 2012 at 9:49 AM, Paolo Castagna
<ca...@googlemail.com> wrote:

> I do not know your use cases and, in particular, I do not know if you
> are trying to provide a faceted navigation UI on top of your TDB store.
> But, from your query that seems the case.
>
> If those times are seconds, that is not going to provide a good user
> experience to your users. ;-)
>
> I do not know if your store is mostly read-only with just a few, non
> frequent and small updates, but if that is the case, you should really
> consider putting a caching layer in front of your TDB store.
> An experimental prototype Andy wrote is here:
>
>  - https://github.com/afs/LD-Access
>
> A completely different alternative would be to use something such as
> Apache Solr or ElasticSearch along side your TDB store, they both
> support facet searches (and they can be quite fast):
>
>  - http://wiki.apache.org/solr/SimpleFacetParameters
>  - http://www.elasticsearch.org/guide/reference/api/search/facets/
>
> None of these options are something you get out-of-the-box though,
> some work and development is involved.

Thanks Paulo - caching is indeed the next thing on the list to look at
:) I hadn't considered the Solr approach, but will do so.

Glenn.

Re: Performance of successive identical queries

Posted by Paolo Castagna <ca...@googlemail.com>.

Glenn Proctor wrote:
> Hi
> 
> I have a TDB instance (0.8.10) containing about 207m triples. I've run
> tdbstats and moved stats.opt into the appropriate place.
> 
> I've noticed that running the same query multiple times in succession
> results in successively shorter query times, up to a point. For
> example, on an otherwise-idle TDB instance, the query
> 
> SELECT ?facet ?val (COUNT(?val) as ?vc) WHERE { ?id a ?val . ?id
> ?facet ?val . } GROUP BY ?facet ?val ORDER BY DESC(?vc) LIMIT 25
> 
> Takes 3707s, then 1424s, then 345s where it seems to stay for subsequent runs.

Hi Glenn,
I do not know your use cases and, in particular, I do not know if you
are trying to provide a faceted navigation UI on top of your TDB store.
But, from your query that seems the case.

If those times are seconds, that is not going to provide a good user
experience to your users. ;-)

I do not know if your store is mostly read-only with just a few, non
frequent and small updates, but if that is the case, you should really
consider putting a caching layer in front of your TDB store.
An experimental prototype Andy wrote is here:

 - https://github.com/afs/LD-Access

A completely different alternative would be to use something such as
Apache Solr or ElasticSearch along side your TDB store, they both
support facet searches (and they can be quite fast):

 - http://wiki.apache.org/solr/SimpleFacetParameters
 - http://www.elasticsearch.org/guide/reference/api/search/facets/

None of these options are something you get out-of-the-box though,
some work and development is involved.

My 2 cents,
Paolo

> 
> What's the reason for this initial improvement and subsequent tailing
> off - are the indexes being optimised with every query?
> 
> Glenn.

Re: Performance of successive identical queries

Posted by Glenn Proctor <gl...@eaglegenomics.com>.

Hi Andy

Thanks for the clarification, it certainly makes sense.

Glenn.


On Tue, Mar 6, 2012 at 9:32 AM, Andy Seaborne <an...@apache.org> wrote:
> On 06/03/12 09:14, Glenn Proctor wrote:
>>
>> Hi
>>
>> I have a TDB instance (0.8.10) containing about 207m triples. I've run
>> tdbstats and moved stats.opt into the appropriate place.
>>
>> I've noticed that running the same query multiple times in succession
>> results in successively shorter query times, up to a point. For
>> example, on an otherwise-idle TDB instance, the query
>>
>> SELECT ?facet ?val (COUNT(?val) as ?vc) WHERE { ?id a ?val . ?id
>> ?facet ?val . } GROUP BY ?facet ?val ORDER BY DESC(?vc) LIMIT 25
>>
>> Takes 3707s, then 1424s, then 345s where it seems to stay for subsequent
>> runs.
>>
>> What's the reason for this initial improvement and subsequent tailing
>> off - are the indexes being optimised with every query?
>>
>> Glenn.
>
>
> Glenn,
>
> Nothing so clever I'm afraid. I think what your seeing is the OS management
> of memory mapped files.
>
> The first run, if a cold system or if queries that have touched different
> parts of indexes, will cause the memory mapped pages to become mapped and
> this is also caching index data in memory.  The latter runs benefit from the
> OS caching.  If the intermediate results are large for the sort, then it's
> spilling to disk, also with possible OS cache effects.
>
>        Andy

Re: Performance of successive identical queries

Posted by Andy Seaborne <an...@apache.org>.

On 06/03/12 09:14, Glenn Proctor wrote:
> Hi
>
> I have a TDB instance (0.8.10) containing about 207m triples. I've run
> tdbstats and moved stats.opt into the appropriate place.
>
> I've noticed that running the same query multiple times in succession
> results in successively shorter query times, up to a point. For
> example, on an otherwise-idle TDB instance, the query
>
> SELECT ?facet ?val (COUNT(?val) as ?vc) WHERE { ?id a ?val . ?id
> ?facet ?val . } GROUP BY ?facet ?val ORDER BY DESC(?vc) LIMIT 25
>
> Takes 3707s, then 1424s, then 345s where it seems to stay for subsequent runs.
>
> What's the reason for this initial improvement and subsequent tailing
> off - are the indexes being optimised with every query?
>
> Glenn.

Glenn,

Nothing so clever I'm afraid. I think what your seeing is the OS 
management of memory mapped files.

The first run, if a cold system or if queries that have touched 
different parts of indexes, will cause the memory mapped pages to become 
mapped and this is also caching index data in memory.  The latter runs 
benefit from the OS caching.  If the intermediate results are large for 
the sort, then it's spilling to disk, also with possible OS cache effects.

	Andy