You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Tristan Roddis <tr...@cogapp.com> on 2013/08/05 15:47:34 UTC

Poor performance with OFFSET

We have a data set with around 160,000 triples corresponding to an 
"mmx:Object", and I am executing the following query to select data from 
these at various offsets:

SELECT DISTINCT *  WHERE {
?object a mmx:Object .
?object mmx:accessionNumber ?accessionNumber .
?object  rdfs:label ?label .
OPTIONAL {
?object mmx:preferredAsset ?asset .
?asset a mmx:Asset .
?asset dcterms:hasVersion ?image .
?image mmx:role 'thumbnail' .
}
} LIMIT 12 OFFSET 10000

However, what I have found is that as I increase the value of this 
offset, the execution time goes up dramatically: from tens of 
milliseconds when the offset is around 1000, to over 4,000 milliseconds 
if the offset is increased to 150,000. (see sample data below, or a 
graph at http://picpaste.com/jena_offsets.png )

I tried generating a stats.opt file, and also using none.opt but this 
behaviour seems to happen regardless: the execution time becomes 
extremely high very quickly, and then plateaus at offsets greater than 
100,000.

Does anyone know any way to get better performance from Jena for offset 
queries such as this?

Thanks in advance,

-Tristan.

Sample data below. Results obtained using Apache Benchmark on another 
machine on the same network segment. Dev server is a Linux machine with 
3Gb total RAM, of which 2Gb allocated to the JVM. Production server is a 
quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the JVM.

Offset 	Dev (ms)
	Prod (ms)
10 	8 	11
100 	15 	17
1000 	84 	90
5000 	260 	460
10000 	463 	701
20000 	811 	1070
30000 	1402 	1502
40000 	1844 	2122
50000 	2515 	2695
60000 	2960 	3290
80000 	3686 	4190
100000 	3840 	4470
120000 	3991 	4416
150000 	3865 	4535


-- 
cogapp

Tristan Roddis
Head of Web Development
+44 1273 821 600
www.cogapp.com <http://www.cogapp.com>


*News*
EnoughFoodIf.org <http://enoughfoodif.org>
Digital programme led by a core group of UK charities, design by Lavish, 
technical architecture and implementation by Cogapp.
For regular Cogapp news, please follow us on Twitter 
<https://twitter.com/cogapp>

Re: Poor performance with OFFSET

Posted by Andy Seaborne <an...@apache.org>.

There is even a fast front end cache for SPARQL

https://svn.apache.org/repos/asf/jena/Experimental/sparql-cache/

that executes the query once then uses the precomputed results for 
LIMIT/OFFSET.  It preserves the order so you get pseudo-paging.

But this code is not part of the release and will need updating.

	Andy


On 05/08/13 21:02, Olivier Rossel wrote:
> Please note that such a query does not sort mmx:Objects in the inner select.
> I have no idea of the performances drop if you add that requirement.
> BTW, is it possible to create an index on (for example) "rdfs:label of
> mmx:Objects"?
>
>
> On Mon, Aug 5, 2013 at 6:40 PM, Rob Vesse <rv...@yarcdata.com> wrote:
>
>> Note that OFFSET is implemented internally by having to iterate over all
>> the unnecessary solutions first.
>>
>> So if you have OFFSET 10000 then the query engine has to generate and
>> discard 10,000 possible solutions before it starts returning actual
>> solutions.  As OFFSET increases the time this takes increases.
>>
>> There is some degree of optimization for these kinds of queries in ARQs
>> optimizer but it doesn't apply in all cases.
>>
>> Rob
>>
>>
>>
>> On 8/5/13 9:21 AM, "Tristan Roddis" <tr...@cogapp.com> wrote:
>>
>>> Thanks, Olivier: that did the trick!
>>>
>>> I tweaked it slightly as not all objects have a rdfs:label populated,
>>> and I need to keep that restriction. But using this query the maximum
>>> execution time drops to around 700ms:
>>>
>>> SELECT DISTINCT *  WHERE {
>>> { SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object .
>>> ?object  rdfs:label ?label . ?object mmx:accessionNumber
>>> ?accessionNumber } LIMIT 12 OFFSET 20000 } .
>>> OPTIONAL {
>>> ?object mmx:preferredAsset ?asset .
>>> ?asset a mmx:Asset .
>>> ?asset dcterms:hasVersion ?image .
>>> ?image mmx:role 'thumbnail' .
>>> }
>>> }
>>>
>>> I also now realise that the lack of rdfs:label (and hence results) was
>>> the reason for the plateau. If I use your query syntax instead, I no
>>> longer see a plateau at 4000s for invalidly high offsets, but rather a
>>> speedy (zero result count) response in around 100ms.
>>>
>>> Thanks again,
>>>
>>> -Tristan.
>>>
>>> On 05/08/2013 15:00, Olivier Rossel wrote:
>>>> Could you please try a query like this:
>>>> SELECT DISTINCT *  WHERE {
>>>> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET 10000 } .
>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>> ?object  rdfs:label ?label .
>>>> OPTIONAL {
>>>> ?object mmx:preferredAsset ?asset .
>>>> ?asset a mmx:Asset .
>>>> ?asset dcterms:hasVersion ?image .
>>>> ?image mmx:role 'thumbnail' .
>>>> }
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com>
>>>> wrote:
>>>>
>>>>> We have a data set with around 160,000 triples corresponding to an
>>>>> "mmx:Object", and I am executing the following query to select data
>>>>> from
>>>>> these at various offsets:
>>>>>
>>>>> SELECT DISTINCT *  WHERE {
>>>>> ?object a mmx:Object .
>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>> ?object  rdfs:label ?label .
>>>>> OPTIONAL {
>>>>> ?object mmx:preferredAsset ?asset .
>>>>> ?asset a mmx:Asset .
>>>>> ?asset dcterms:hasVersion ?image .
>>>>> ?image mmx:role 'thumbnail' .
>>>>> }
>>>>> } LIMIT 12 OFFSET 10000
>>>>>
>>>>> However, what I have found is that as I increase the value of this
>>>>> offset,
>>>>> the execution time goes up dramatically: from tens of milliseconds
>>>>> when the
>>>>> offset is around 1000, to over 4,000 milliseconds if the offset is
>>>>> increased to 150,000. (see sample data below, or a graph at
>>>>>
>>>>> http://picpaste.com/jena_**offsets.png<http://picpaste.com/jena_offsets
>> .
>>>>> png>)
>>>>>
>>>>> I tried generating a stats.opt file, and also using none.opt but this
>>>>> behaviour seems to happen regardless: the execution time becomes
>>>>> extremely
>>>>> high very quickly, and then plateaus at offsets greater than 100,000.
>>>>>
>>>>> Does anyone know any way to get better performance from Jena for offset
>>>>> queries such as this?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> -Tristan.
>>>>>
>>>>> Sample data below. Results obtained using Apache Benchmark on another
>>>>> machine on the same network segment. Dev server is a Linux machine
>>>>> with 3Gb
>>>>> total RAM, of which 2Gb allocated to the JVM. Production server is a
>>>>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the
>>>>> JVM.
>>>>>
>>>>> Offset  Dev (ms)
>>>>>           Prod (ms)
>>>>> 10      8       11
>>>>> 100     15      17
>>>>> 1000    84      90
>>>>> 5000    260     460
>>>>> 10000   463     701
>>>>> 20000   811     1070
>>>>> 30000   1402    1502
>>>>> 40000   1844    2122
>>>>> 50000   2515    2695
>>>>> 60000   2960    3290
>>>>> 80000   3686    4190
>>>>> 100000  3840    4470
>>>>> 120000  3991    4416
>>>>> 150000  3865    4535
>>>>>
>>>>>
>>>>> --
>>>>> cogapp
>>>>>
>>>>> Tristan Roddis
>>>>> Head of Web Development
>>>>> +44 1273 821 600
>>>>> www.cogapp.com <http://www.cogapp.com>
>>>>>
>>>>>
>>>>> *News*
>>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>>> Digital programme led by a core group of UK charities, design by
>>>>> Lavish,
>>>>> technical architecture and implementation by Cogapp.
>>>>> For regular Cogapp news, please follow us on Twitter <
>>>>> https://twitter.com/cogapp>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> cogapp
>>>
>>> Tristan Roddis
>>> Head of Web Development
>>> +44 1273 821 600
>>> www.cogapp.com <http://www.cogapp.com>
>>>
>>>
>>> *News*
>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>> Digital programme led by a core group of UK charities, design by Lavish,
>>> technical architecture and implementation by Cogapp.
>>> For regular Cogapp news, please follow us on Twitter
>>> <https://twitter.com/cogapp>
>>>
>>>
>>>
>>>
>>
>>
>

Re: Poor performance with OFFSET

Posted by Rob Vesse <rv...@yarcdata.com>.

No you can't create a index that pre-computes the sorting of some
particular facet of your data.

Btw if you are using TDB then there shouldn't be any need to use an ORDER
BY to get paging since TDB will for any given query run multiple times
always return the data in a predictable fashion due to how it does index
scans and reads.  Obviously if you want a specific ordering then you need
to add an ORDER BY

Rob



On 8/6/13 8:10 AM, "Olivier Rossel" <ol...@gmail.com> wrote:

>Using LIMIT/OFFSET sounds like paging, to me.
>And a usual requirement of paging is to order data before paging.
>Ordering can be for example "order alphabetically mmx:Objects by their
>label".
>So the question is: can I create an index that precomputes "the sorting of
>labels of mmx:Objects".
>
>Andy s answer is pretty interesting (and more general).
>Tristan, the link you provide sounds interesting too. But I am not sure
>how
>it helps for the creation of the index I described above.
>
>
>
>
>On Tue, Aug 6, 2013 at 4:59 PM, Tristan Roddis <tr...@cogapp.com>
>wrote:
>
>> On 05/08/2013 21:02, Olivier Rossel wrote:
>>
>>> Please note that such a query does not sort mmx:Objects in the inner
>>> select.
>>> I have no idea of the performances drop if you add that requirement.
>>>
>> Using my data: quite a lot. For a sample offset of 50000, it increases
>> from 380ms without, to 1600ms with ordering.
>>
>>  BTW, is it possible to create an index on (for example) "rdfs:label of
>>> mmx:Objects"?
>>>
>> I'm pretty new to Jena and I'm afraid I'm not sure what you mean! Do you
>> mean adding a full-text search index as described at
>> 
>>http://jena.apache.org/**documentation/query/text-**query.html<http://jen
>>a.apache.org/documentation/query/text-query.html>or something else?
>>
>> Thanks,
>>
>> -Tristan.
>>
>>
>>  On Mon, Aug 5, 2013 at 6:40 PM, Rob Vesse <rv...@yarcdata.com> wrote:
>>
>>> Note that OFFSET is implemented internally by having to iterate over
>>>all
>>>> the unnecessary solutions first.
>>>>
>>>> So if you have OFFSET 10000 then the query engine has to generate and
>>>> discard 10,000 possible solutions before it starts returning actual
>>>> solutions.  As OFFSET increases the time this takes increases.
>>>>
>>>> There is some degree of optimization for these kinds of queries in
>>>>ARQs
>>>> optimizer but it doesn't apply in all cases.
>>>>
>>>> Rob
>>>>
>>>>
>>>>
>>>> On 8/5/13 9:21 AM, "Tristan Roddis" <tr...@cogapp.com> wrote:
>>>>
>>>>  Thanks, Olivier: that did the trick!
>>>>>
>>>>> I tweaked it slightly as not all objects have a rdfs:label populated,
>>>>> and I need to keep that restriction. But using this query the maximum
>>>>> execution time drops to around 700ms:
>>>>>
>>>>> SELECT DISTINCT *  WHERE {
>>>>> { SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object
>>>>>.
>>>>> ?object  rdfs:label ?label . ?object mmx:accessionNumber
>>>>> ?accessionNumber } LIMIT 12 OFFSET 20000 } .
>>>>> OPTIONAL {
>>>>> ?object mmx:preferredAsset ?asset .
>>>>> ?asset a mmx:Asset .
>>>>> ?asset dcterms:hasVersion ?image .
>>>>> ?image mmx:role 'thumbnail' .
>>>>> }
>>>>> }
>>>>>
>>>>> I also now realise that the lack of rdfs:label (and hence results)
>>>>>was
>>>>> the reason for the plateau. If I use your query syntax instead, I no
>>>>> longer see a plateau at 4000s for invalidly high offsets, but rather
>>>>>a
>>>>> speedy (zero result count) response in around 100ms.
>>>>>
>>>>> Thanks again,
>>>>>
>>>>> -Tristan.
>>>>>
>>>>> On 05/08/2013 15:00, Olivier Rossel wrote:
>>>>>
>>>>>> Could you please try a query like this:
>>>>>> SELECT DISTINCT *  WHERE {
>>>>>> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET
>>>>>>10000 }
>>>>>> .
>>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>>> ?object  rdfs:label ?label .
>>>>>> OPTIONAL {
>>>>>> ?object mmx:preferredAsset ?asset .
>>>>>> ?asset a mmx:Asset .
>>>>>> ?asset dcterms:hasVersion ?image .
>>>>>> ?image mmx:role 'thumbnail' .
>>>>>> }
>>>>>> }
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com>
>>>>>> wrote:
>>>>>>
>>>>>>  We have a data set with around 160,000 triples corresponding to an
>>>>>>> "mmx:Object", and I am executing the following query to select data
>>>>>>> from
>>>>>>> these at various offsets:
>>>>>>>
>>>>>>> SELECT DISTINCT *  WHERE {
>>>>>>> ?object a mmx:Object .
>>>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>>>> ?object  rdfs:label ?label .
>>>>>>> OPTIONAL {
>>>>>>> ?object mmx:preferredAsset ?asset .
>>>>>>> ?asset a mmx:Asset .
>>>>>>> ?asset dcterms:hasVersion ?image .
>>>>>>> ?image mmx:role 'thumbnail' .
>>>>>>> }
>>>>>>> } LIMIT 12 OFFSET 10000
>>>>>>>
>>>>>>> However, what I have found is that as I increase the value of this
>>>>>>> offset,
>>>>>>> the execution time goes up dramatically: from tens of milliseconds
>>>>>>> when the
>>>>>>> offset is around 1000, to over 4,000 milliseconds if the offset is
>>>>>>> increased to 150,000. (see sample data below, or a graph at
>>>>>>>
>>>>>>> 
>>>>>>>http://picpaste.com/jena_****offsets.png<http://picpaste.com/jena_**
>>>>>>>offsets.png>
>>>>>>> 
>>>>>>><http://picpaste.**com/jena_offsets<http://picpaste.com/jena_offsets
>>>>>>>>
>>>>>>>
>>>>>> .
>>>>
>>>>> png>)
>>>>>>>
>>>>>>> I tried generating a stats.opt file, and also using none.opt but
>>>>>>>this
>>>>>>> behaviour seems to happen regardless: the execution time becomes
>>>>>>> extremely
>>>>>>> high very quickly, and then plateaus at offsets greater than
>>>>>>>100,000.
>>>>>>>
>>>>>>> Does anyone know any way to get better performance from Jena for
>>>>>>> offset
>>>>>>> queries such as this?
>>>>>>>
>>>>>>> Thanks in advance,
>>>>>>>
>>>>>>> -Tristan.
>>>>>>>
>>>>>>> Sample data below. Results obtained using Apache Benchmark on
>>>>>>>another
>>>>>>> machine on the same network segment. Dev server is a Linux machine
>>>>>>> with 3Gb
>>>>>>> total RAM, of which 2Gb allocated to the JVM. Production server is
>>>>>>>a
>>>>>>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to
>>>>>>>the
>>>>>>> JVM.
>>>>>>>
>>>>>>> Offset  Dev (ms)
>>>>>>>           Prod (ms)
>>>>>>> 10      8       11
>>>>>>> 100     15      17
>>>>>>> 1000    84      90
>>>>>>> 5000    260     460
>>>>>>> 10000   463     701
>>>>>>> 20000   811     1070
>>>>>>> 30000   1402    1502
>>>>>>> 40000   1844    2122
>>>>>>> 50000   2515    2695
>>>>>>> 60000   2960    3290
>>>>>>> 80000   3686    4190
>>>>>>> 100000  3840    4470
>>>>>>> 120000  3991    4416
>>>>>>> 150000  3865    4535
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> cogapp
>>>>>>>
>>>>>>> Tristan Roddis
>>>>>>> Head of Web Development
>>>>>>> +44 1273 821 600
>>>>>>> www.cogapp.com <http://www.cogapp.com>
>>>>>>>
>>>>>>>
>>>>>>> *News*
>>>>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>>>>> Digital programme led by a core group of UK charities, design by
>>>>>>> Lavish,
>>>>>>> technical architecture and implementation by Cogapp.
>>>>>>> For regular Cogapp news, please follow us on Twitter <
>>>>>>> https://twitter.com/cogapp>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> cogapp
>>>>>
>>>>> Tristan Roddis
>>>>> Head of Web Development
>>>>> +44 1273 821 600
>>>>> www.cogapp.com <http://www.cogapp.com>
>>>>>
>>>>>
>>>>> *News*
>>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>>> Digital programme led by a core group of UK charities, design by
>>>>>Lavish,
>>>>> technical architecture and implementation by Cogapp.
>>>>> For regular Cogapp news, please follow us on Twitter
>>>>> <https://twitter.com/cogapp>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>
>> --
>> cogapp
>>
>> Tristan Roddis
>> Head of Web Development
>> +44 1273 821 600
>> www.cogapp.com <http://www.cogapp.com>
>>
>>
>> *News*
>> EnoughFoodIf.org <http://enoughfoodif.org>
>> Digital programme led by a core group of UK charities, design by Lavish,
>> technical architecture and implementation by Cogapp.
>> For regular Cogapp news, please follow us on Twitter <
>> https://twitter.com/cogapp>
>>
>>
>>
>>
>>

Re: Poor performance with OFFSET

Posted by Olivier Rossel <ol...@gmail.com>.

Using LIMIT/OFFSET sounds like paging, to me.
And a usual requirement of paging is to order data before paging.
Ordering can be for example "order alphabetically mmx:Objects by their
label".
So the question is: can I create an index that precomputes "the sorting of
labels of mmx:Objects".

Andy s answer is pretty interesting (and more general).
Tristan, the link you provide sounds interesting too. But I am not sure how
it helps for the creation of the index I described above.




On Tue, Aug 6, 2013 at 4:59 PM, Tristan Roddis <tr...@cogapp.com> wrote:

> On 05/08/2013 21:02, Olivier Rossel wrote:
>
>> Please note that such a query does not sort mmx:Objects in the inner
>> select.
>> I have no idea of the performances drop if you add that requirement.
>>
> Using my data: quite a lot. For a sample offset of 50000, it increases
> from 380ms without, to 1600ms with ordering.
>
>  BTW, is it possible to create an index on (for example) "rdfs:label of
>> mmx:Objects"?
>>
> I'm pretty new to Jena and I'm afraid I'm not sure what you mean! Do you
> mean adding a full-text search index as described at
> http://jena.apache.org/**documentation/query/text-**query.html<http://jena.apache.org/documentation/query/text-query.html>or something else?
>
> Thanks,
>
> -Tristan.
>
>
>  On Mon, Aug 5, 2013 at 6:40 PM, Rob Vesse <rv...@yarcdata.com> wrote:
>
>> Note that OFFSET is implemented internally by having to iterate over all
>>> the unnecessary solutions first.
>>>
>>> So if you have OFFSET 10000 then the query engine has to generate and
>>> discard 10,000 possible solutions before it starts returning actual
>>> solutions.  As OFFSET increases the time this takes increases.
>>>
>>> There is some degree of optimization for these kinds of queries in ARQs
>>> optimizer but it doesn't apply in all cases.
>>>
>>> Rob
>>>
>>>
>>>
>>> On 8/5/13 9:21 AM, "Tristan Roddis" <tr...@cogapp.com> wrote:
>>>
>>>  Thanks, Olivier: that did the trick!
>>>>
>>>> I tweaked it slightly as not all objects have a rdfs:label populated,
>>>> and I need to keep that restriction. But using this query the maximum
>>>> execution time drops to around 700ms:
>>>>
>>>> SELECT DISTINCT *  WHERE {
>>>> { SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object .
>>>> ?object  rdfs:label ?label . ?object mmx:accessionNumber
>>>> ?accessionNumber } LIMIT 12 OFFSET 20000 } .
>>>> OPTIONAL {
>>>> ?object mmx:preferredAsset ?asset .
>>>> ?asset a mmx:Asset .
>>>> ?asset dcterms:hasVersion ?image .
>>>> ?image mmx:role 'thumbnail' .
>>>> }
>>>> }
>>>>
>>>> I also now realise that the lack of rdfs:label (and hence results) was
>>>> the reason for the plateau. If I use your query syntax instead, I no
>>>> longer see a plateau at 4000s for invalidly high offsets, but rather a
>>>> speedy (zero result count) response in around 100ms.
>>>>
>>>> Thanks again,
>>>>
>>>> -Tristan.
>>>>
>>>> On 05/08/2013 15:00, Olivier Rossel wrote:
>>>>
>>>>> Could you please try a query like this:
>>>>> SELECT DISTINCT *  WHERE {
>>>>> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET 10000 }
>>>>> .
>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>> ?object  rdfs:label ?label .
>>>>> OPTIONAL {
>>>>> ?object mmx:preferredAsset ?asset .
>>>>> ?asset a mmx:Asset .
>>>>> ?asset dcterms:hasVersion ?image .
>>>>> ?image mmx:role 'thumbnail' .
>>>>> }
>>>>> }
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com>
>>>>> wrote:
>>>>>
>>>>>  We have a data set with around 160,000 triples corresponding to an
>>>>>> "mmx:Object", and I am executing the following query to select data
>>>>>> from
>>>>>> these at various offsets:
>>>>>>
>>>>>> SELECT DISTINCT *  WHERE {
>>>>>> ?object a mmx:Object .
>>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>>> ?object  rdfs:label ?label .
>>>>>> OPTIONAL {
>>>>>> ?object mmx:preferredAsset ?asset .
>>>>>> ?asset a mmx:Asset .
>>>>>> ?asset dcterms:hasVersion ?image .
>>>>>> ?image mmx:role 'thumbnail' .
>>>>>> }
>>>>>> } LIMIT 12 OFFSET 10000
>>>>>>
>>>>>> However, what I have found is that as I increase the value of this
>>>>>> offset,
>>>>>> the execution time goes up dramatically: from tens of milliseconds
>>>>>> when the
>>>>>> offset is around 1000, to over 4,000 milliseconds if the offset is
>>>>>> increased to 150,000. (see sample data below, or a graph at
>>>>>>
>>>>>> http://picpaste.com/jena_****offsets.png<http://picpaste.com/jena_**offsets.png>
>>>>>> <http://picpaste.**com/jena_offsets<http://picpaste.com/jena_offsets>
>>>>>>
>>>>> .
>>>
>>>> png>)
>>>>>>
>>>>>> I tried generating a stats.opt file, and also using none.opt but this
>>>>>> behaviour seems to happen regardless: the execution time becomes
>>>>>> extremely
>>>>>> high very quickly, and then plateaus at offsets greater than 100,000.
>>>>>>
>>>>>> Does anyone know any way to get better performance from Jena for
>>>>>> offset
>>>>>> queries such as this?
>>>>>>
>>>>>> Thanks in advance,
>>>>>>
>>>>>> -Tristan.
>>>>>>
>>>>>> Sample data below. Results obtained using Apache Benchmark on another
>>>>>> machine on the same network segment. Dev server is a Linux machine
>>>>>> with 3Gb
>>>>>> total RAM, of which 2Gb allocated to the JVM. Production server is a
>>>>>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the
>>>>>> JVM.
>>>>>>
>>>>>> Offset  Dev (ms)
>>>>>>           Prod (ms)
>>>>>> 10      8       11
>>>>>> 100     15      17
>>>>>> 1000    84      90
>>>>>> 5000    260     460
>>>>>> 10000   463     701
>>>>>> 20000   811     1070
>>>>>> 30000   1402    1502
>>>>>> 40000   1844    2122
>>>>>> 50000   2515    2695
>>>>>> 60000   2960    3290
>>>>>> 80000   3686    4190
>>>>>> 100000  3840    4470
>>>>>> 120000  3991    4416
>>>>>> 150000  3865    4535
>>>>>>
>>>>>>
>>>>>> --
>>>>>> cogapp
>>>>>>
>>>>>> Tristan Roddis
>>>>>> Head of Web Development
>>>>>> +44 1273 821 600
>>>>>> www.cogapp.com <http://www.cogapp.com>
>>>>>>
>>>>>>
>>>>>> *News*
>>>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>>>> Digital programme led by a core group of UK charities, design by
>>>>>> Lavish,
>>>>>> technical architecture and implementation by Cogapp.
>>>>>> For regular Cogapp news, please follow us on Twitter <
>>>>>> https://twitter.com/cogapp>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>> --
>>>> cogapp
>>>>
>>>> Tristan Roddis
>>>> Head of Web Development
>>>> +44 1273 821 600
>>>> www.cogapp.com <http://www.cogapp.com>
>>>>
>>>>
>>>> *News*
>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>> Digital programme led by a core group of UK charities, design by Lavish,
>>>> technical architecture and implementation by Cogapp.
>>>> For regular Cogapp news, please follow us on Twitter
>>>> <https://twitter.com/cogapp>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>
> --
> cogapp
>
> Tristan Roddis
> Head of Web Development
> +44 1273 821 600
> www.cogapp.com <http://www.cogapp.com>
>
>
> *News*
> EnoughFoodIf.org <http://enoughfoodif.org>
> Digital programme led by a core group of UK charities, design by Lavish,
> technical architecture and implementation by Cogapp.
> For regular Cogapp news, please follow us on Twitter <
> https://twitter.com/cogapp>
>
>
>
>
>

Re: Poor performance with OFFSET

Posted by Tristan Roddis <tr...@cogapp.com>.

On 05/08/2013 21:02, Olivier Rossel wrote:
> Please note that such a query does not sort mmx:Objects in the inner select.
> I have no idea of the performances drop if you add that requirement.
Using my data: quite a lot. For a sample offset of 50000, it increases 
from 380ms without, to 1600ms with ordering.
> BTW, is it possible to create an index on (for example) "rdfs:label of
> mmx:Objects"?
I'm pretty new to Jena and I'm afraid I'm not sure what you mean! Do you 
mean adding a full-text search index as described at 
http://jena.apache.org/documentation/query/text-query.html or something 
else?

Thanks,

-Tristan.

  On Mon, Aug 5, 2013 at 6:40 PM, Rob Vesse <rv...@yarcdata.com> wrote:
>> Note that OFFSET is implemented internally by having to iterate over all
>> the unnecessary solutions first.
>>
>> So if you have OFFSET 10000 then the query engine has to generate and
>> discard 10,000 possible solutions before it starts returning actual
>> solutions.  As OFFSET increases the time this takes increases.
>>
>> There is some degree of optimization for these kinds of queries in ARQs
>> optimizer but it doesn't apply in all cases.
>>
>> Rob
>>
>>
>>
>> On 8/5/13 9:21 AM, "Tristan Roddis" <tr...@cogapp.com> wrote:
>>
>>> Thanks, Olivier: that did the trick!
>>>
>>> I tweaked it slightly as not all objects have a rdfs:label populated,
>>> and I need to keep that restriction. But using this query the maximum
>>> execution time drops to around 700ms:
>>>
>>> SELECT DISTINCT *  WHERE {
>>> { SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object .
>>> ?object  rdfs:label ?label . ?object mmx:accessionNumber
>>> ?accessionNumber } LIMIT 12 OFFSET 20000 } .
>>> OPTIONAL {
>>> ?object mmx:preferredAsset ?asset .
>>> ?asset a mmx:Asset .
>>> ?asset dcterms:hasVersion ?image .
>>> ?image mmx:role 'thumbnail' .
>>> }
>>> }
>>>
>>> I also now realise that the lack of rdfs:label (and hence results) was
>>> the reason for the plateau. If I use your query syntax instead, I no
>>> longer see a plateau at 4000s for invalidly high offsets, but rather a
>>> speedy (zero result count) response in around 100ms.
>>>
>>> Thanks again,
>>>
>>> -Tristan.
>>>
>>> On 05/08/2013 15:00, Olivier Rossel wrote:
>>>> Could you please try a query like this:
>>>> SELECT DISTINCT *  WHERE {
>>>> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET 10000 } .
>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>> ?object  rdfs:label ?label .
>>>> OPTIONAL {
>>>> ?object mmx:preferredAsset ?asset .
>>>> ?asset a mmx:Asset .
>>>> ?asset dcterms:hasVersion ?image .
>>>> ?image mmx:role 'thumbnail' .
>>>> }
>>>> }
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com>
>>>> wrote:
>>>>
>>>>> We have a data set with around 160,000 triples corresponding to an
>>>>> "mmx:Object", and I am executing the following query to select data
>>>>> from
>>>>> these at various offsets:
>>>>>
>>>>> SELECT DISTINCT *  WHERE {
>>>>> ?object a mmx:Object .
>>>>> ?object mmx:accessionNumber ?accessionNumber .
>>>>> ?object  rdfs:label ?label .
>>>>> OPTIONAL {
>>>>> ?object mmx:preferredAsset ?asset .
>>>>> ?asset a mmx:Asset .
>>>>> ?asset dcterms:hasVersion ?image .
>>>>> ?image mmx:role 'thumbnail' .
>>>>> }
>>>>> } LIMIT 12 OFFSET 10000
>>>>>
>>>>> However, what I have found is that as I increase the value of this
>>>>> offset,
>>>>> the execution time goes up dramatically: from tens of milliseconds
>>>>> when the
>>>>> offset is around 1000, to over 4,000 milliseconds if the offset is
>>>>> increased to 150,000. (see sample data below, or a graph at
>>>>>
>>>>> http://picpaste.com/jena_**offsets.png<http://picpaste.com/jena_offsets
>> .
>>>>> png>)
>>>>>
>>>>> I tried generating a stats.opt file, and also using none.opt but this
>>>>> behaviour seems to happen regardless: the execution time becomes
>>>>> extremely
>>>>> high very quickly, and then plateaus at offsets greater than 100,000.
>>>>>
>>>>> Does anyone know any way to get better performance from Jena for offset
>>>>> queries such as this?
>>>>>
>>>>> Thanks in advance,
>>>>>
>>>>> -Tristan.
>>>>>
>>>>> Sample data below. Results obtained using Apache Benchmark on another
>>>>> machine on the same network segment. Dev server is a Linux machine
>>>>> with 3Gb
>>>>> total RAM, of which 2Gb allocated to the JVM. Production server is a
>>>>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the
>>>>> JVM.
>>>>>
>>>>> Offset  Dev (ms)
>>>>>           Prod (ms)
>>>>> 10      8       11
>>>>> 100     15      17
>>>>> 1000    84      90
>>>>> 5000    260     460
>>>>> 10000   463     701
>>>>> 20000   811     1070
>>>>> 30000   1402    1502
>>>>> 40000   1844    2122
>>>>> 50000   2515    2695
>>>>> 60000   2960    3290
>>>>> 80000   3686    4190
>>>>> 100000  3840    4470
>>>>> 120000  3991    4416
>>>>> 150000  3865    4535
>>>>>
>>>>>
>>>>> --
>>>>> cogapp
>>>>>
>>>>> Tristan Roddis
>>>>> Head of Web Development
>>>>> +44 1273 821 600
>>>>> www.cogapp.com <http://www.cogapp.com>
>>>>>
>>>>>
>>>>> *News*
>>>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>>>> Digital programme led by a core group of UK charities, design by
>>>>> Lavish,
>>>>> technical architecture and implementation by Cogapp.
>>>>> For regular Cogapp news, please follow us on Twitter <
>>>>> https://twitter.com/cogapp>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>
>>> --
>>> cogapp
>>>
>>> Tristan Roddis
>>> Head of Web Development
>>> +44 1273 821 600
>>> www.cogapp.com <http://www.cogapp.com>
>>>
>>>
>>> *News*
>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>> Digital programme led by a core group of UK charities, design by Lavish,
>>> technical architecture and implementation by Cogapp.
>>> For regular Cogapp news, please follow us on Twitter
>>> <https://twitter.com/cogapp>
>>>
>>>
>>>
>>>
>>


-- 
cogapp

Tristan Roddis
Head of Web Development
+44 1273 821 600
www.cogapp.com <http://www.cogapp.com>


*News*
EnoughFoodIf.org <http://enoughfoodif.org>
Digital programme led by a core group of UK charities, design by Lavish, 
technical architecture and implementation by Cogapp.
For regular Cogapp news, please follow us on Twitter 
<https://twitter.com/cogapp>

Re: Poor performance with OFFSET

Posted by Olivier Rossel <ol...@gmail.com>.

Please note that such a query does not sort mmx:Objects in the inner select.
I have no idea of the performances drop if you add that requirement.
BTW, is it possible to create an index on (for example) "rdfs:label of
mmx:Objects"?


On Mon, Aug 5, 2013 at 6:40 PM, Rob Vesse <rv...@yarcdata.com> wrote:

> Note that OFFSET is implemented internally by having to iterate over all
> the unnecessary solutions first.
>
> So if you have OFFSET 10000 then the query engine has to generate and
> discard 10,000 possible solutions before it starts returning actual
> solutions.  As OFFSET increases the time this takes increases.
>
> There is some degree of optimization for these kinds of queries in ARQs
> optimizer but it doesn't apply in all cases.
>
> Rob
>
>
>
> On 8/5/13 9:21 AM, "Tristan Roddis" <tr...@cogapp.com> wrote:
>
> >Thanks, Olivier: that did the trick!
> >
> >I tweaked it slightly as not all objects have a rdfs:label populated,
> >and I need to keep that restriction. But using this query the maximum
> >execution time drops to around 700ms:
> >
> >SELECT DISTINCT *  WHERE {
> >{ SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object .
> >?object  rdfs:label ?label . ?object mmx:accessionNumber
> >?accessionNumber } LIMIT 12 OFFSET 20000 } .
> >OPTIONAL {
> >?object mmx:preferredAsset ?asset .
> >?asset a mmx:Asset .
> >?asset dcterms:hasVersion ?image .
> >?image mmx:role 'thumbnail' .
> >}
> >}
> >
> >I also now realise that the lack of rdfs:label (and hence results) was
> >the reason for the plateau. If I use your query syntax instead, I no
> >longer see a plateau at 4000s for invalidly high offsets, but rather a
> >speedy (zero result count) response in around 100ms.
> >
> >Thanks again,
> >
> >-Tristan.
> >
> >On 05/08/2013 15:00, Olivier Rossel wrote:
> >> Could you please try a query like this:
> >> SELECT DISTINCT *  WHERE {
> >> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET 10000 } .
> >> ?object mmx:accessionNumber ?accessionNumber .
> >> ?object  rdfs:label ?label .
> >> OPTIONAL {
> >> ?object mmx:preferredAsset ?asset .
> >> ?asset a mmx:Asset .
> >> ?asset dcterms:hasVersion ?image .
> >> ?image mmx:role 'thumbnail' .
> >> }
> >> }
> >>
> >>
> >>
> >>
> >>
> >> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com>
> >>wrote:
> >>
> >>> We have a data set with around 160,000 triples corresponding to an
> >>> "mmx:Object", and I am executing the following query to select data
> >>>from
> >>> these at various offsets:
> >>>
> >>> SELECT DISTINCT *  WHERE {
> >>> ?object a mmx:Object .
> >>> ?object mmx:accessionNumber ?accessionNumber .
> >>> ?object  rdfs:label ?label .
> >>> OPTIONAL {
> >>> ?object mmx:preferredAsset ?asset .
> >>> ?asset a mmx:Asset .
> >>> ?asset dcterms:hasVersion ?image .
> >>> ?image mmx:role 'thumbnail' .
> >>> }
> >>> } LIMIT 12 OFFSET 10000
> >>>
> >>> However, what I have found is that as I increase the value of this
> >>>offset,
> >>> the execution time goes up dramatically: from tens of milliseconds
> >>>when the
> >>> offset is around 1000, to over 4,000 milliseconds if the offset is
> >>> increased to 150,000. (see sample data below, or a graph at
> >>>
> >>>http://picpaste.com/jena_**offsets.png<http://picpaste.com/jena_offsets
> .
> >>>png>)
> >>>
> >>> I tried generating a stats.opt file, and also using none.opt but this
> >>> behaviour seems to happen regardless: the execution time becomes
> >>>extremely
> >>> high very quickly, and then plateaus at offsets greater than 100,000.
> >>>
> >>> Does anyone know any way to get better performance from Jena for offset
> >>> queries such as this?
> >>>
> >>> Thanks in advance,
> >>>
> >>> -Tristan.
> >>>
> >>> Sample data below. Results obtained using Apache Benchmark on another
> >>> machine on the same network segment. Dev server is a Linux machine
> >>>with 3Gb
> >>> total RAM, of which 2Gb allocated to the JVM. Production server is a
> >>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the
> >>>JVM.
> >>>
> >>> Offset  Dev (ms)
> >>>          Prod (ms)
> >>> 10      8       11
> >>> 100     15      17
> >>> 1000    84      90
> >>> 5000    260     460
> >>> 10000   463     701
> >>> 20000   811     1070
> >>> 30000   1402    1502
> >>> 40000   1844    2122
> >>> 50000   2515    2695
> >>> 60000   2960    3290
> >>> 80000   3686    4190
> >>> 100000  3840    4470
> >>> 120000  3991    4416
> >>> 150000  3865    4535
> >>>
> >>>
> >>> --
> >>> cogapp
> >>>
> >>> Tristan Roddis
> >>> Head of Web Development
> >>> +44 1273 821 600
> >>> www.cogapp.com <http://www.cogapp.com>
> >>>
> >>>
> >>> *News*
> >>> EnoughFoodIf.org <http://enoughfoodif.org>
> >>> Digital programme led by a core group of UK charities, design by
> >>>Lavish,
> >>> technical architecture and implementation by Cogapp.
> >>> For regular Cogapp news, please follow us on Twitter <
> >>> https://twitter.com/cogapp>
> >>>
> >>>
> >>>
> >>>
> >>>
> >
> >
> >--
> >cogapp
> >
> >Tristan Roddis
> >Head of Web Development
> >+44 1273 821 600
> >www.cogapp.com <http://www.cogapp.com>
> >
> >
> >*News*
> >EnoughFoodIf.org <http://enoughfoodif.org>
> >Digital programme led by a core group of UK charities, design by Lavish,
> >technical architecture and implementation by Cogapp.
> >For regular Cogapp news, please follow us on Twitter
> ><https://twitter.com/cogapp>
> >
> >
> >
> >
>
>

Re: Poor performance with OFFSET

Posted by Rob Vesse <rv...@yarcdata.com>.

Note that OFFSET is implemented internally by having to iterate over all
the unnecessary solutions first.

So if you have OFFSET 10000 then the query engine has to generate and
discard 10,000 possible solutions before it starts returning actual
solutions.  As OFFSET increases the time this takes increases.

There is some degree of optimization for these kinds of queries in ARQs
optimizer but it doesn't apply in all cases.

Rob



On 8/5/13 9:21 AM, "Tristan Roddis" <tr...@cogapp.com> wrote:

>Thanks, Olivier: that did the trick!
>
>I tweaked it slightly as not all objects have a rdfs:label populated,
>and I need to keep that restriction. But using this query the maximum
>execution time drops to around 700ms:
>
>SELECT DISTINCT *  WHERE {
>{ SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object .
>?object  rdfs:label ?label . ?object mmx:accessionNumber
>?accessionNumber } LIMIT 12 OFFSET 20000 } .
>OPTIONAL {
>?object mmx:preferredAsset ?asset .
>?asset a mmx:Asset .
>?asset dcterms:hasVersion ?image .
>?image mmx:role 'thumbnail' .
>}
>}
>
>I also now realise that the lack of rdfs:label (and hence results) was
>the reason for the plateau. If I use your query syntax instead, I no
>longer see a plateau at 4000s for invalidly high offsets, but rather a
>speedy (zero result count) response in around 100ms.
>
>Thanks again,
>
>-Tristan.
>
>On 05/08/2013 15:00, Olivier Rossel wrote:
>> Could you please try a query like this:
>> SELECT DISTINCT *  WHERE {
>> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET 10000 } .
>> ?object mmx:accessionNumber ?accessionNumber .
>> ?object  rdfs:label ?label .
>> OPTIONAL {
>> ?object mmx:preferredAsset ?asset .
>> ?asset a mmx:Asset .
>> ?asset dcterms:hasVersion ?image .
>> ?image mmx:role 'thumbnail' .
>> }
>> }
>>
>>
>>
>>
>>
>> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com>
>>wrote:
>>
>>> We have a data set with around 160,000 triples corresponding to an
>>> "mmx:Object", and I am executing the following query to select data
>>>from
>>> these at various offsets:
>>>
>>> SELECT DISTINCT *  WHERE {
>>> ?object a mmx:Object .
>>> ?object mmx:accessionNumber ?accessionNumber .
>>> ?object  rdfs:label ?label .
>>> OPTIONAL {
>>> ?object mmx:preferredAsset ?asset .
>>> ?asset a mmx:Asset .
>>> ?asset dcterms:hasVersion ?image .
>>> ?image mmx:role 'thumbnail' .
>>> }
>>> } LIMIT 12 OFFSET 10000
>>>
>>> However, what I have found is that as I increase the value of this
>>>offset,
>>> the execution time goes up dramatically: from tens of milliseconds
>>>when the
>>> offset is around 1000, to over 4,000 milliseconds if the offset is
>>> increased to 150,000. (see sample data below, or a graph at
>>> 
>>>http://picpaste.com/jena_**offsets.png<http://picpaste.com/jena_offsets.
>>>png>)
>>>
>>> I tried generating a stats.opt file, and also using none.opt but this
>>> behaviour seems to happen regardless: the execution time becomes
>>>extremely
>>> high very quickly, and then plateaus at offsets greater than 100,000.
>>>
>>> Does anyone know any way to get better performance from Jena for offset
>>> queries such as this?
>>>
>>> Thanks in advance,
>>>
>>> -Tristan.
>>>
>>> Sample data below. Results obtained using Apache Benchmark on another
>>> machine on the same network segment. Dev server is a Linux machine
>>>with 3Gb
>>> total RAM, of which 2Gb allocated to the JVM. Production server is a
>>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the
>>>JVM.
>>>
>>> Offset  Dev (ms)
>>>          Prod (ms)
>>> 10      8       11
>>> 100     15      17
>>> 1000    84      90
>>> 5000    260     460
>>> 10000   463     701
>>> 20000   811     1070
>>> 30000   1402    1502
>>> 40000   1844    2122
>>> 50000   2515    2695
>>> 60000   2960    3290
>>> 80000   3686    4190
>>> 100000  3840    4470
>>> 120000  3991    4416
>>> 150000  3865    4535
>>>
>>>
>>> --
>>> cogapp
>>>
>>> Tristan Roddis
>>> Head of Web Development
>>> +44 1273 821 600
>>> www.cogapp.com <http://www.cogapp.com>
>>>
>>>
>>> *News*
>>> EnoughFoodIf.org <http://enoughfoodif.org>
>>> Digital programme led by a core group of UK charities, design by
>>>Lavish,
>>> technical architecture and implementation by Cogapp.
>>> For regular Cogapp news, please follow us on Twitter <
>>> https://twitter.com/cogapp>
>>>
>>>
>>>
>>>
>>>
>
>
>-- 
>cogapp
>
>Tristan Roddis
>Head of Web Development
>+44 1273 821 600
>www.cogapp.com <http://www.cogapp.com>
>
>
>*News*
>EnoughFoodIf.org <http://enoughfoodif.org>
>Digital programme led by a core group of UK charities, design by Lavish,
>technical architecture and implementation by Cogapp.
>For regular Cogapp news, please follow us on Twitter
><https://twitter.com/cogapp>
>
>
>
>

Re: Poor performance with OFFSET

Posted by Tristan Roddis <tr...@cogapp.com>.

Thanks, Olivier: that did the trick!

I tweaked it slightly as not all objects have a rdfs:label populated, 
and I need to keep that restriction. But using this query the maximum 
execution time drops to around 700ms:

SELECT DISTINCT *  WHERE {
{ SELECT ?object ?label ?accessionNumber WHERE {?object a mmx:Object . 
?object  rdfs:label ?label . ?object mmx:accessionNumber 
?accessionNumber } LIMIT 12 OFFSET 20000 } .
OPTIONAL {
?object mmx:preferredAsset ?asset .
?asset a mmx:Asset .
?asset dcterms:hasVersion ?image .
?image mmx:role 'thumbnail' .
}
}

I also now realise that the lack of rdfs:label (and hence results) was 
the reason for the plateau. If I use your query syntax instead, I no 
longer see a plateau at 4000s for invalidly high offsets, but rather a 
speedy (zero result count) response in around 100ms.

Thanks again,

-Tristan.

On 05/08/2013 15:00, Olivier Rossel wrote:
> Could you please try a query like this:
> SELECT DISTINCT *  WHERE {
> { SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET 10000 } .
> ?object mmx:accessionNumber ?accessionNumber .
> ?object  rdfs:label ?label .
> OPTIONAL {
> ?object mmx:preferredAsset ?asset .
> ?asset a mmx:Asset .
> ?asset dcterms:hasVersion ?image .
> ?image mmx:role 'thumbnail' .
> }
> }
>
>
>
>
>
> On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com> wrote:
>
>> We have a data set with around 160,000 triples corresponding to an
>> "mmx:Object", and I am executing the following query to select data from
>> these at various offsets:
>>
>> SELECT DISTINCT *  WHERE {
>> ?object a mmx:Object .
>> ?object mmx:accessionNumber ?accessionNumber .
>> ?object  rdfs:label ?label .
>> OPTIONAL {
>> ?object mmx:preferredAsset ?asset .
>> ?asset a mmx:Asset .
>> ?asset dcterms:hasVersion ?image .
>> ?image mmx:role 'thumbnail' .
>> }
>> } LIMIT 12 OFFSET 10000
>>
>> However, what I have found is that as I increase the value of this offset,
>> the execution time goes up dramatically: from tens of milliseconds when the
>> offset is around 1000, to over 4,000 milliseconds if the offset is
>> increased to 150,000. (see sample data below, or a graph at
>> http://picpaste.com/jena_**offsets.png<http://picpaste.com/jena_offsets.png>)
>>
>> I tried generating a stats.opt file, and also using none.opt but this
>> behaviour seems to happen regardless: the execution time becomes extremely
>> high very quickly, and then plateaus at offsets greater than 100,000.
>>
>> Does anyone know any way to get better performance from Jena for offset
>> queries such as this?
>>
>> Thanks in advance,
>>
>> -Tristan.
>>
>> Sample data below. Results obtained using Apache Benchmark on another
>> machine on the same network segment. Dev server is a Linux machine with 3Gb
>> total RAM, of which 2Gb allocated to the JVM. Production server is a
>> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the JVM.
>>
>> Offset  Dev (ms)
>>          Prod (ms)
>> 10      8       11
>> 100     15      17
>> 1000    84      90
>> 5000    260     460
>> 10000   463     701
>> 20000   811     1070
>> 30000   1402    1502
>> 40000   1844    2122
>> 50000   2515    2695
>> 60000   2960    3290
>> 80000   3686    4190
>> 100000  3840    4470
>> 120000  3991    4416
>> 150000  3865    4535
>>
>>
>> --
>> cogapp
>>
>> Tristan Roddis
>> Head of Web Development
>> +44 1273 821 600
>> www.cogapp.com <http://www.cogapp.com>
>>
>>
>> *News*
>> EnoughFoodIf.org <http://enoughfoodif.org>
>> Digital programme led by a core group of UK charities, design by Lavish,
>> technical architecture and implementation by Cogapp.
>> For regular Cogapp news, please follow us on Twitter <
>> https://twitter.com/cogapp>
>>
>>
>>
>>
>>


-- 
cogapp

Tristan Roddis
Head of Web Development
+44 1273 821 600
www.cogapp.com <http://www.cogapp.com>


*News*
EnoughFoodIf.org <http://enoughfoodif.org>
Digital programme led by a core group of UK charities, design by Lavish, 
technical architecture and implementation by Cogapp.
For regular Cogapp news, please follow us on Twitter 
<https://twitter.com/cogapp>

Re: Poor performance with OFFSET

Posted by Olivier Rossel <ol...@gmail.com>.

Could you please try a query like this:
SELECT DISTINCT *  WHERE {
{ SELECT ?object WHERE {?object a mmx:Object } LIMIT 12 OFFSET 10000 } .
?object mmx:accessionNumber ?accessionNumber .
?object  rdfs:label ?label .
OPTIONAL {
?object mmx:preferredAsset ?asset .
?asset a mmx:Asset .
?asset dcterms:hasVersion ?image .
?image mmx:role 'thumbnail' .
}
}





On Mon, Aug 5, 2013 at 3:47 PM, Tristan Roddis <tr...@cogapp.com> wrote:

> We have a data set with around 160,000 triples corresponding to an
> "mmx:Object", and I am executing the following query to select data from
> these at various offsets:
>
> SELECT DISTINCT *  WHERE {
> ?object a mmx:Object .
> ?object mmx:accessionNumber ?accessionNumber .
> ?object  rdfs:label ?label .
> OPTIONAL {
> ?object mmx:preferredAsset ?asset .
> ?asset a mmx:Asset .
> ?asset dcterms:hasVersion ?image .
> ?image mmx:role 'thumbnail' .
> }
> } LIMIT 12 OFFSET 10000
>
> However, what I have found is that as I increase the value of this offset,
> the execution time goes up dramatically: from tens of milliseconds when the
> offset is around 1000, to over 4,000 milliseconds if the offset is
> increased to 150,000. (see sample data below, or a graph at
> http://picpaste.com/jena_**offsets.png<http://picpaste.com/jena_offsets.png>)
>
> I tried generating a stats.opt file, and also using none.opt but this
> behaviour seems to happen regardless: the execution time becomes extremely
> high very quickly, and then plateaus at offsets greater than 100,000.
>
> Does anyone know any way to get better performance from Jena for offset
> queries such as this?
>
> Thanks in advance,
>
> -Tristan.
>
> Sample data below. Results obtained using Apache Benchmark on another
> machine on the same network segment. Dev server is a Linux machine with 3Gb
> total RAM, of which 2Gb allocated to the JVM. Production server is a
> quad-core Windows server with 16Gb RAM, of which 8 Gb allocated to the JVM.
>
> Offset  Dev (ms)
>         Prod (ms)
> 10      8       11
> 100     15      17
> 1000    84      90
> 5000    260     460
> 10000   463     701
> 20000   811     1070
> 30000   1402    1502
> 40000   1844    2122
> 50000   2515    2695
> 60000   2960    3290
> 80000   3686    4190
> 100000  3840    4470
> 120000  3991    4416
> 150000  3865    4535
>
>
> --
> cogapp
>
> Tristan Roddis
> Head of Web Development
> +44 1273 821 600
> www.cogapp.com <http://www.cogapp.com>
>
>
> *News*
> EnoughFoodIf.org <http://enoughfoodif.org>
> Digital programme led by a core group of UK charities, design by Lavish,
> technical architecture and implementation by Cogapp.
> For regular Cogapp news, please follow us on Twitter <
> https://twitter.com/cogapp>
>
>
>
>
>