You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Paul Tyson <ph...@sbcglobal.net> on 2016/01/06 17:17:40 UTC

optimizing serialization of results from fuseki

I have a modest (17M triple) dataset, fairly flat graph. I run some
queries selecting nodes with anywhere from 12-20 different property
values.

Result set counts are anywhere from 10,000 to 30,000 nodes. Total
execution time measured at client are in the 30-40 second range.

The web request begins streaming results immediately, but seems to take
longer than it should (based on the number of results and size of data
transfer). I also notice that the time is roughly linear with the size
of dataset--halving the dataset size halves the result set and the
execution time. I wouldn't have expected this behavior if all the time
was due to an indexed search.

My question is: is total query time limited by search execution speed,
or by marshaling and serialization of search results? 

I have tried different query patterns, and believe I have the best
queries possible for the use case.

I'm looking for other suggestions to reduce overall execution time. The
performance does not improve drastically going from 4Gb to 8 or 16Gb
RAM. My test platforms are 64-bit Windows, ranging from small server
(16Gb RAM, 4 CPU) to laptops with 4Gb RAM.

Thanks,
--Paul

Re: optimizing serialization of results from fuseki

Posted by Håvard Mikkelsen Ottestad <ha...@acando.no>.

Hi,

Can you post a link to your dataset and queries?

If the data isn’t sensitive that is.

Regards,
Håvard M. Ottestad





On 06/01/16 17:17, "Paul Tyson" <ph...@sbcglobal.net> wrote:

>I have a modest (17M triple) dataset, fairly flat graph. I run some
>queries selecting nodes with anywhere from 12-20 different property
>values.
>
>Result set counts are anywhere from 10,000 to 30,000 nodes. Total
>execution time measured at client are in the 30-40 second range.
>
>The web request begins streaming results immediately, but seems to take
>longer than it should (based on the number of results and size of data
>transfer). I also notice that the time is roughly linear with the size
>of dataset--halving the dataset size halves the result set and the
>execution time. I wouldn't have expected this behavior if all the time
>was due to an indexed search.
>
>My question is: is total query time limited by search execution speed,
>or by marshaling and serialization of search results? 
>
>I have tried different query patterns, and believe I have the best
>queries possible for the use case.
>
>I'm looking for other suggestions to reduce overall execution time. The
>performance does not improve drastically going from 4Gb to 8 or 16Gb
>RAM. My test platforms are 64-bit Windows, ranging from small server
>(16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
>
>Thanks,
>--Paul
>

Re: optimizing serialization of results from fuseki

Posted by Paul Tyson <ph...@sbcglobal.net>.

Rob, thanks for the comments. Responses inline.

On Thu, 2016-01-07 at 16:52 +0000, Rob Vesse wrote:
> Thoughts inline:
> 
> On 07/01/2016 15:56, "Paul Tyson" <ph...@sbcglobal.net> wrote:
> 
> >Here is an actual query, partially obfuscated. It returns about 18K
> >nodes in 40 seconds, from a dataset of about 17M triples. (The nodes are
> >not necessarily distinct.)
> >
> >The predominant graph structure is like:
> >
> >?node <- ?lsu -> ?detail -> LSUPROPERTYVALUE
> >
> >Thanks for your attention and any suggestions for improvement.
> >
> >prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> >prefix xsd: <http://www.w3.org/2001/XMLSchema#>
> >prefix lsu: <http://rules.example.org/ns/lsu#>
> >prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >SELECT (count(?node) as ?cnt)
> >WHERE {
> >?detail lsu:source "XYZ".
> >?detail lsu:length-type "Ltype".
> >?detail lsu:max-length-exclusive ?maxe_len;
> >  lsu:max-length-inclusive ?maxi_len;
> >  lsu:min-length-inclusive ?mine_len;
> >  lsu:min-length-exclusive ?mini_len.
> >FILTER (
> >  (?maxe_len = rdf:nil || ?maxe_len > "95"^^xsd:decimal)
> >  && (?maxi_len = rdf:nil || ?maxi_len >= "95"^^xsd:decimal)
> >  && (?mine_len = rdf:nil || ?mine_len < "95"^^xsd:decimal)
> >  && (?mini_len  = rdf:nil || ?mini_len <= "95"^^xsd:decimal)
> >)
> 
> It looks like you create triples of the form <subject> <predicate> rdf:nil
> to indicate non-existent values.  Typically people would simply omit those
> triple(s) and infer the non-existent value by lack of matching (using
> OPTIONAL/EXISTS/NOT EXISTS as desired).  This would probably be a
> significant change to your data model so perhaps not an option for you but
> worth mentioning.

Actually the initial data model did allow missing property triples,
necessitating OPTIONAL clauses in the query. These queries performed
very poorly. When I filled out all the missing values with rdf:nil and
removed OPTIONAL clauses I began seeing performance that was more nearly
acceptable. Our dataset maintenance use cases allow us to ensure that
all properties are valuated.

> 
> If you were willing to make this change then you'd need to use OPTIONAL
> for things that might not have values present which has other performance
> costs associated with it.  However if you did make this change then TDB
> can potentially push the numeric filter condition down into its index scan
> which would likely outweigh any negative performance impact of using
> OPTIONAL.  As it stands because all the conditions are logical or
> conditions they can't be pushed down directly into TDB index scans AFAIK
> (Andy - please correct me if I'm wrong)
> 
> One possible experiment to try that doesn't require changing the data
> model would be to drop all the ?var = rdf:nil clauses from your various
> FILTER expressions and see what effect on performance that has.
> 
> >?detail lsu:date-type "Date type 1".
> >{{
> >  ?detail lsu:retroactive true;
> >    lsu:end-date rdf:nil .
> >} UNION {
> >  ?detail lsu:retroactive false;
> >    lsu:start-date ?start ;
> >    lsu:end-date ?end .
> >  FILTER (?start <= "2006-08-11"^^xsd:date
> >  && (?end = rdf:nil || ?end >= "2006-08-11"^^xsd:date))
> >}}
> >?detail lsu:minimum-age ?min_age;
> >  lsu:maximum-age ?max_age.
> >FILTER ((?max_age = rdf:nil || ?max_age >= 8)
> > && (?min_age = 0 || ?min_age < 8))
> >?detail lsu:applicable-for "adfsda" .
> >?detail lsu:v-type ?v_type.
> >FILTER (?v_type in (rdf:nil, <http://www.example.org/2015/7/abc>))
> 
> Using the IN syntax can be quite expensive because a store has to
> potentially pick out a large set of potential matches and then filter, for
> a small number of values using a UNION with one of the constants
> substituted into each branch may offer better performance though not sure
> how much it will help here.  Note that Jena performs this optimization
> anyway so probably makes little difference to your query
> 
> >?detail lsu:s-type ?s_type.
> >FILTER (?s_type in (rdf:nil, <http://www.example.org/2015/7/dsfgdsa>))
> >?detail lsu:max-gg-exclusive ?maxe_gg;
> >  lsu:max-gg-inclusive ?maxi_gg;
> >  lsu:min-gg-inclusive ?mine_gg;
> >  lsu:min-gg-exclusive ?mini_gg.
> >FILTER (
> >  (?maxe_gg = rdf:nil || ?maxe_gg > "50"^^xsd:decimal)
> >  && (?maxi_gg = rdf:nil || ?maxi_gg >= "50"^^xsd:decimal)
> >  && (?mine_gg = rdf:nil || ?mine_gg < "50"^^xsd:decimal)
> >  && (?mini_gg = rdf:nil || ?mini_gg <= "50"^^xsd:decimal)
> >)
> 
> Again same point about use of rdf:nil
> 
> >?detail lsu:h-m ?h_m.
> >FILTER (?h_m in (rdf:nil, <http://www.example.org/2015/7/hm1>))
> >{{
> >?detail lsu:v-func ?v_func.
> >FILTER (?v_func in
> >(<http://www.example.org/2015/7/vf1>,<http://www.example.org/2015/7/vf2>))
> >} UNION {
> >?detail lsu:c-n ?c_n.
> >FILTER (?c_n in
> >(<http://www.example.org/2015/7/cn1>,<http://www.example.org/2015/7/cn2>,<
> >http://www.example.org/2015/7/cn3>,<http://www.example.org/2015/7/cn4>))
> >}}
> >?lsu lsu:lsu-d ?detail.
> >?lsu lsu:aF ?node.
> >}
> 
> In general I'm not sure I see huge room for improvement without knowing
> more about the statistics of the data.  
That's what I have about concluded. Even knowing the statistics, it
would be difficult to optimize in general because the filter criteria
for a property could be highly selective or not at all depending on the
specific values passed in the query.

I did see a modest gain by moving the constant-valued triples to the end
of the pattern, but this is still not enough to be useful. Most of these
are not highly selective anyway so should be last.

> If a later part of the query is
> likely to match a smaller subset of the data than an earlier part then you
> can try moving the pieces of the query around.  You have a lot of value
> based filters which tend to be hard to optimised in general and as noted
> their use as arguments to || expressions probably blocks TDB from pushing
> those down directly into the index scans.
> 
> The algebra for the query looks reasonable other than being large because
> the query is large, all the FILTER expressions look like they get pushed
> down and evaluated as soon as reasonably possible.
> 
> Rob
> 
> >
> >
> >On Thu, 2016-01-07 at 12:36 +0000, Andy Seaborne wrote:
> >> It looks like it is the query cost and not the
> >> 
> >> > So I conclude we are seeing the best performance possible unless there
> >> > is something terribly wrong with my queries. They are essentially of
> >>the
> >> > form:
> >> >
> >> 
> >> Details matter here - can you show a real query?
> >> 
> >> > select ?s
> >> > where {
> >> > ?nd :prop1 <uri1>;
> >> >   :prop2 "lit1";
> >> >   :prop3 ?var1;
> >> >   :prop4 ?var2;
> >> > # more properties of ?s
> >> 
> >> ?s doesn't appear until later.
> >> 
> >> There is a chance there are cross products in the real query.
> >> 
> >> > filter (?var1 > N1 && ?var1 < N2)
> >> > filter (?var2 in (<uriA>,<uriB>,...))
> >> 
> >> This usually gets optimized - maybe something else in your query is
> >> blocking that.
> >> 
> >> Filter order can matter as well.
> >> 
> >> > #more filters on ?nd properties
> >> > ?s :p1/:p2 ?nd.
> >> > }
> >> >
> >> > Some of the filters get a little more complicated. And there is at
> >>least
> >> > one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected
> >>the
> >> > queries and run each individual piece (triple + filter), and it seems
> >>to
> >> > be the more complicated filters that start to slow things down, as
> >>might
> >> > be expected.
> >> >
> >> > Thanks for your comments and interest. The performance we're seeing is
> >> > unacceptable for our application requirements, so I wanted to see if
> >> > there were any other performance factors I had missed.
> >> 
> >> 	Andy
> >> 
> >> On 07/01/16 08:48, Håvard Mikkelsen Ottestad wrote:
> >> > Hi,
> >> >
> >> > Reordering the filters might help.
> >> >
> >> > Also, maybe a stats file would reorder your query to be faster. I
> >>dunno how often (or if) fuseki generates a stats file. You can try to
> >>generate one by hand when fuseki is shutdown:
> >>https://jena.apache.org/documentation/tdb/optimizer.html
> >> >
> >> > Also I’m wondering what the performance is like if you take this line
> >>away:
> >> > ?s :p1/:p2 ?nd.
> >> >
> >> >
> >> > One major performance drain I have seen in the past is filters on
> >>string literals. Especially if you are doing anything like CONTAINS or
> >>LOWERCASE. Do you have any of that?
> >> >
> >> > Håvard
> >> >
> >> >
> >> >
> >> >
> >> > On 07/01/16 03:51, "Paul Tyson" <ph...@sbcglobal.net> wrote:
> >> >
> >> >> On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
> >> >>> Hi Paul,
> >> >>>
> >> >>>   > My question is: is total query time limited by search execution
> >>speed,
> >> >>>   > or by marshaling and serialization of search results?
> >> >>>
> >> >>> Costs are a bit of both but normally mainly query.  It also depends
> >>on
> >> >>> the client processing.
> >> >>>
> >> >>> Some context please:
> >> >>> 1/ What's the storage layer?
> >> >> TDB behind fuseki 2.3.1
> >> >>
> >> >>> 2/ What result set format are you getting?
> >> >> text/csv
> >> >>
> >> >>> 3/ How are you handling the results on receipt in the client?
> >> >> Just writing them to file for testing.
> >> >>
> >> >>>
> >> >>> (Håvard point about seeing data and query also applies)
> >> >> Sorry, not easy to share the data.
> >> >>
> >> >>>
> >> >>> The important point is that output is streamed.
> >> >>>
> >> >>> Result sent while the query is execution; it is not the case that
> >>the
> >> >>> query executes,. all the results calculated and then results are
> >>produced.
> >> >>>
> >> >>> To investigate, modify the query to do something like this
> >> >>>
> >> >>> SELECT (count(*) AS ?C) { ... }
> >> >>>
> >> >>> because then the result set cost is low and all the query is
> >>executed
> >> >>> before a result can be produced.
> >> >>>
> >> >> Yes, I did that, and the time is very nearly the same.
> >> >>
> >> >> So I conclude we are seeing the best performance possible unless
> >>there
> >> >> is something terribly wrong with my queries. They are essentially of
> >>the
> >> >> form:
> >> >>
> >> >> select ?s
> >> >> where {
> >> >> ?nd :prop1 <uri1>;
> >> >>   :prop2 "lit1";
> >> >>   :prop3 ?var1;
> >> >>   :prop4 ?var2;
> >> >> # more properties of ?s
> >> >> filter (?var1 > N1 && ?var1 < N2)
> >> >> filter (?var2 in (<uriA>,<uriB>,...))
> >> >> #more filters on ?nd properties
> >> >> ?s :p1/:p2 ?nd.
> >> >> }
> >> >>
> >> >> Some of the filters get a little more complicated. And there is at
> >>least
> >> >> one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected
> >>the
> >> >> queries and run each individual piece (triple + filter), and it
> >>seems to
> >> >> be the more complicated filters that start to slow things down, as
> >>might
> >> >> be expected.
> >> >>
> >> >> Thanks for your comments and interest. The performance we're seeing
> >>is
> >> >> unacceptable for our application requirements, so I wanted to see if
> >> >> there were any other performance factors I had missed.
> >> >>
> >> >> Regards,
> >> >> --Paul
> >> >>
> >> >>>      Andy
> >> >>>
> >> >>>
> >> >>> On 06/01/16 16:17, Paul Tyson wrote:
> >> >>>> I have a modest (17M triple) dataset, fairly flat graph. I run some
> >> >>>> queries selecting nodes with anywhere from 12-20 different property
> >> >>>> values.
> >> >>>>
> >> >>>> Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> >> >>>> execution time measured at client are in the 30-40 second range.
> >> >>>>
> >> >>>> The web request begins streaming results immediately, but seems to
> >>take
> >> >>>> longer than it should (based on the number of results and size of
> >>data
> >> >>>> transfer). I also notice that the time is roughly linear with the
> >>size
> >> >>>> of dataset--halving the dataset size halves the result set and the
> >> >>>> execution time. I wouldn't have expected this behavior if all the
> >>time
> >> >>>> was due to an indexed search.
> >> >>>>
> >> >>>> My question is: is total query time limited by search execution
> >>speed,
> >> >>>> or by marshaling and serialization of search results?
> >> >>>>
> >> >>>> I have tried different query patterns, and believe I have the best
> >> >>>> queries possible for the use case.
> >> >>>>
> >> >>>> I'm looking for other suggestions to reduce overall execution
> >>time. The
> >> >>>> performance does not improve drastically going from 4Gb to 8 or
> >>16Gb
> >> >>>> RAM. My test platforms are 64-bit Windows, ranging from small
> >>server
> >> >>>> (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
> >> >>>>
> >> >>>> Thanks,
> >> >>>> --Paul
> >> >>>>
> >> >>>
> >> >>
> >> >>
> >> 
> >
> >
> 
> 
> 
>

Re: optimizing serialization of results from fuseki

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Thoughts inline:

On 07/01/2016 15:56, "Paul Tyson" <ph...@sbcglobal.net> wrote:

>Here is an actual query, partially obfuscated. It returns about 18K
>nodes in 40 seconds, from a dataset of about 17M triples. (The nodes are
>not necessarily distinct.)
>
>The predominant graph structure is like:
>
>?node <- ?lsu -> ?detail -> LSUPROPERTYVALUE
>
>Thanks for your attention and any suggestions for improvement.
>
>prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>prefix xsd: <http://www.w3.org/2001/XMLSchema#>
>prefix lsu: <http://rules.example.org/ns/lsu#>
>prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>SELECT (count(?node) as ?cnt)
>WHERE {
>?detail lsu:source "XYZ".
>?detail lsu:length-type "Ltype".
>?detail lsu:max-length-exclusive ?maxe_len;
>  lsu:max-length-inclusive ?maxi_len;
>  lsu:min-length-inclusive ?mine_len;
>  lsu:min-length-exclusive ?mini_len.
>FILTER (
>  (?maxe_len = rdf:nil || ?maxe_len > "95"^^xsd:decimal)
>  && (?maxi_len = rdf:nil || ?maxi_len >= "95"^^xsd:decimal)
>  && (?mine_len = rdf:nil || ?mine_len < "95"^^xsd:decimal)
>  && (?mini_len  = rdf:nil || ?mini_len <= "95"^^xsd:decimal)
>)

It looks like you create triples of the form <subject> <predicate> rdf:nil
to indicate non-existent values.  Typically people would simply omit those
triple(s) and infer the non-existent value by lack of matching (using
OPTIONAL/EXISTS/NOT EXISTS as desired).  This would probably be a
significant change to your data model so perhaps not an option for you but
worth mentioning.

If you were willing to make this change then you'd need to use OPTIONAL
for things that might not have values present which has other performance
costs associated with it.  However if you did make this change then TDB
can potentially push the numeric filter condition down into its index scan
which would likely outweigh any negative performance impact of using
OPTIONAL.  As it stands because all the conditions are logical or
conditions they can't be pushed down directly into TDB index scans AFAIK
(Andy - please correct me if I'm wrong)

One possible experiment to try that doesn't require changing the data
model would be to drop all the ?var = rdf:nil clauses from your various
FILTER expressions and see what effect on performance that has.

>?detail lsu:date-type "Date type 1".
>{{
>  ?detail lsu:retroactive true;
>    lsu:end-date rdf:nil .
>} UNION {
>  ?detail lsu:retroactive false;
>    lsu:start-date ?start ;
>    lsu:end-date ?end .
>  FILTER (?start <= "2006-08-11"^^xsd:date
>  && (?end = rdf:nil || ?end >= "2006-08-11"^^xsd:date))
>}}
>?detail lsu:minimum-age ?min_age;
>  lsu:maximum-age ?max_age.
>FILTER ((?max_age = rdf:nil || ?max_age >= 8)
> && (?min_age = 0 || ?min_age < 8))
>?detail lsu:applicable-for "adfsda" .
>?detail lsu:v-type ?v_type.
>FILTER (?v_type in (rdf:nil, <http://www.example.org/2015/7/abc>))

Using the IN syntax can be quite expensive because a store has to
potentially pick out a large set of potential matches and then filter, for
a small number of values using a UNION with one of the constants
substituted into each branch may offer better performance though not sure
how much it will help here.  Note that Jena performs this optimization
anyway so probably makes little difference to your query

>?detail lsu:s-type ?s_type.
>FILTER (?s_type in (rdf:nil, <http://www.example.org/2015/7/dsfgdsa>))
>?detail lsu:max-gg-exclusive ?maxe_gg;
>  lsu:max-gg-inclusive ?maxi_gg;
>  lsu:min-gg-inclusive ?mine_gg;
>  lsu:min-gg-exclusive ?mini_gg.
>FILTER (
>  (?maxe_gg = rdf:nil || ?maxe_gg > "50"^^xsd:decimal)
>  && (?maxi_gg = rdf:nil || ?maxi_gg >= "50"^^xsd:decimal)
>  && (?mine_gg = rdf:nil || ?mine_gg < "50"^^xsd:decimal)
>  && (?mini_gg = rdf:nil || ?mini_gg <= "50"^^xsd:decimal)
>)

Again same point about use of rdf:nil

>?detail lsu:h-m ?h_m.
>FILTER (?h_m in (rdf:nil, <http://www.example.org/2015/7/hm1>))
>{{
>?detail lsu:v-func ?v_func.
>FILTER (?v_func in
>(<http://www.example.org/2015/7/vf1>,<http://www.example.org/2015/7/vf2>))
>} UNION {
>?detail lsu:c-n ?c_n.
>FILTER (?c_n in
>(<http://www.example.org/2015/7/cn1>,<http://www.example.org/2015/7/cn2>,<
>http://www.example.org/2015/7/cn3>,<http://www.example.org/2015/7/cn4>))
>}}
>?lsu lsu:lsu-d ?detail.
>?lsu lsu:aF ?node.
>}

In general I'm not sure I see huge room for improvement without knowing
more about the statistics of the data.  If a later part of the query is
likely to match a smaller subset of the data than an earlier part then you
can try moving the pieces of the query around.  You have a lot of value
based filters which tend to be hard to optimised in general and as noted
their use as arguments to || expressions probably blocks TDB from pushing
those down directly into the index scans.

The algebra for the query looks reasonable other than being large because
the query is large, all the FILTER expressions look like they get pushed
down and evaluated as soon as reasonably possible.

Rob

>
>
>On Thu, 2016-01-07 at 12:36 +0000, Andy Seaborne wrote:
>> It looks like it is the query cost and not the
>> 
>> > So I conclude we are seeing the best performance possible unless there
>> > is something terribly wrong with my queries. They are essentially of
>>the
>> > form:
>> >
>> 
>> Details matter here - can you show a real query?
>> 
>> > select ?s
>> > where {
>> > ?nd :prop1 <uri1>;
>> >   :prop2 "lit1";
>> >   :prop3 ?var1;
>> >   :prop4 ?var2;
>> > # more properties of ?s
>> 
>> ?s doesn't appear until later.
>> 
>> There is a chance there are cross products in the real query.
>> 
>> > filter (?var1 > N1 && ?var1 < N2)
>> > filter (?var2 in (<uriA>,<uriB>,...))
>> 
>> This usually gets optimized - maybe something else in your query is
>> blocking that.
>> 
>> Filter order can matter as well.
>> 
>> > #more filters on ?nd properties
>> > ?s :p1/:p2 ?nd.
>> > }
>> >
>> > Some of the filters get a little more complicated. And there is at
>>least
>> > one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected
>>the
>> > queries and run each individual piece (triple + filter), and it seems
>>to
>> > be the more complicated filters that start to slow things down, as
>>might
>> > be expected.
>> >
>> > Thanks for your comments and interest. The performance we're seeing is
>> > unacceptable for our application requirements, so I wanted to see if
>> > there were any other performance factors I had missed.
>> 
>> 	Andy
>> 
>> On 07/01/16 08:48, Håvard Mikkelsen Ottestad wrote:
>> > Hi,
>> >
>> > Reordering the filters might help.
>> >
>> > Also, maybe a stats file would reorder your query to be faster. I
>>dunno how often (or if) fuseki generates a stats file. You can try to
>>generate one by hand when fuseki is shutdown:
>>https://jena.apache.org/documentation/tdb/optimizer.html
>> >
>> > Also I’m wondering what the performance is like if you take this line
>>away:
>> > ?s :p1/:p2 ?nd.
>> >
>> >
>> > One major performance drain I have seen in the past is filters on
>>string literals. Especially if you are doing anything like CONTAINS or
>>LOWERCASE. Do you have any of that?
>> >
>> > Håvard
>> >
>> >
>> >
>> >
>> > On 07/01/16 03:51, "Paul Tyson" <ph...@sbcglobal.net> wrote:
>> >
>> >> On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
>> >>> Hi Paul,
>> >>>
>> >>>   > My question is: is total query time limited by search execution
>>speed,
>> >>>   > or by marshaling and serialization of search results?
>> >>>
>> >>> Costs are a bit of both but normally mainly query.  It also depends
>>on
>> >>> the client processing.
>> >>>
>> >>> Some context please:
>> >>> 1/ What's the storage layer?
>> >> TDB behind fuseki 2.3.1
>> >>
>> >>> 2/ What result set format are you getting?
>> >> text/csv
>> >>
>> >>> 3/ How are you handling the results on receipt in the client?
>> >> Just writing them to file for testing.
>> >>
>> >>>
>> >>> (Håvard point about seeing data and query also applies)
>> >> Sorry, not easy to share the data.
>> >>
>> >>>
>> >>> The important point is that output is streamed.
>> >>>
>> >>> Result sent while the query is execution; it is not the case that
>>the
>> >>> query executes,. all the results calculated and then results are
>>produced.
>> >>>
>> >>> To investigate, modify the query to do something like this
>> >>>
>> >>> SELECT (count(*) AS ?C) { ... }
>> >>>
>> >>> because then the result set cost is low and all the query is
>>executed
>> >>> before a result can be produced.
>> >>>
>> >> Yes, I did that, and the time is very nearly the same.
>> >>
>> >> So I conclude we are seeing the best performance possible unless
>>there
>> >> is something terribly wrong with my queries. They are essentially of
>>the
>> >> form:
>> >>
>> >> select ?s
>> >> where {
>> >> ?nd :prop1 <uri1>;
>> >>   :prop2 "lit1";
>> >>   :prop3 ?var1;
>> >>   :prop4 ?var2;
>> >> # more properties of ?s
>> >> filter (?var1 > N1 && ?var1 < N2)
>> >> filter (?var2 in (<uriA>,<uriB>,...))
>> >> #more filters on ?nd properties
>> >> ?s :p1/:p2 ?nd.
>> >> }
>> >>
>> >> Some of the filters get a little more complicated. And there is at
>>least
>> >> one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected
>>the
>> >> queries and run each individual piece (triple + filter), and it
>>seems to
>> >> be the more complicated filters that start to slow things down, as
>>might
>> >> be expected.
>> >>
>> >> Thanks for your comments and interest. The performance we're seeing
>>is
>> >> unacceptable for our application requirements, so I wanted to see if
>> >> there were any other performance factors I had missed.
>> >>
>> >> Regards,
>> >> --Paul
>> >>
>> >>>      Andy
>> >>>
>> >>>
>> >>> On 06/01/16 16:17, Paul Tyson wrote:
>> >>>> I have a modest (17M triple) dataset, fairly flat graph. I run some
>> >>>> queries selecting nodes with anywhere from 12-20 different property
>> >>>> values.
>> >>>>
>> >>>> Result set counts are anywhere from 10,000 to 30,000 nodes. Total
>> >>>> execution time measured at client are in the 30-40 second range.
>> >>>>
>> >>>> The web request begins streaming results immediately, but seems to
>>take
>> >>>> longer than it should (based on the number of results and size of
>>data
>> >>>> transfer). I also notice that the time is roughly linear with the
>>size
>> >>>> of dataset--halving the dataset size halves the result set and the
>> >>>> execution time. I wouldn't have expected this behavior if all the
>>time
>> >>>> was due to an indexed search.
>> >>>>
>> >>>> My question is: is total query time limited by search execution
>>speed,
>> >>>> or by marshaling and serialization of search results?
>> >>>>
>> >>>> I have tried different query patterns, and believe I have the best
>> >>>> queries possible for the use case.
>> >>>>
>> >>>> I'm looking for other suggestions to reduce overall execution
>>time. The
>> >>>> performance does not improve drastically going from 4Gb to 8 or
>>16Gb
>> >>>> RAM. My test platforms are 64-bit Windows, ranging from small
>>server
>> >>>> (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
>> >>>>
>> >>>> Thanks,
>> >>>> --Paul
>> >>>>
>> >>>
>> >>
>> >>
>> 
>
>

Re: optimizing serialization of results from fuseki

Posted by Paul Tyson <ph...@sbcglobal.net>.

Here is an actual query, partially obfuscated. It returns about 18K
nodes in 40 seconds, from a dataset of about 17M triples. (The nodes are
not necessarily distinct.)

The predominant graph structure is like:

?node <- ?lsu -> ?detail -> LSUPROPERTYVALUE

Thanks for your attention and any suggestions for improvement.

prefix rdfs: <http://www.w3.org/2000/01/rdf-schema#>
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
prefix lsu: <http://rules.example.org/ns/lsu#>
prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT (count(?node) as ?cnt)
WHERE {
?detail lsu:source "XYZ".
?detail lsu:length-type "Ltype".
?detail lsu:max-length-exclusive ?maxe_len;
  lsu:max-length-inclusive ?maxi_len;
  lsu:min-length-inclusive ?mine_len;
  lsu:min-length-exclusive ?mini_len.
FILTER (
  (?maxe_len = rdf:nil || ?maxe_len > "95"^^xsd:decimal)
  && (?maxi_len = rdf:nil || ?maxi_len >= "95"^^xsd:decimal)
  && (?mine_len = rdf:nil || ?mine_len < "95"^^xsd:decimal)
  && (?mini_len  = rdf:nil || ?mini_len <= "95"^^xsd:decimal)
)
?detail lsu:date-type "Date type 1".
{{
  ?detail lsu:retroactive true;
    lsu:end-date rdf:nil .
} UNION {
  ?detail lsu:retroactive false;
    lsu:start-date ?start ;
    lsu:end-date ?end .
  FILTER (?start <= "2006-08-11"^^xsd:date
  && (?end = rdf:nil || ?end >= "2006-08-11"^^xsd:date))
}}
?detail lsu:minimum-age ?min_age;
  lsu:maximum-age ?max_age.
FILTER ((?max_age = rdf:nil || ?max_age >= 8)
 && (?min_age = 0 || ?min_age < 8))
?detail lsu:applicable-for "adfsda" .
?detail lsu:v-type ?v_type.
FILTER (?v_type in (rdf:nil, <http://www.example.org/2015/7/abc>))
?detail lsu:s-type ?s_type.
FILTER (?s_type in (rdf:nil, <http://www.example.org/2015/7/dsfgdsa>))
?detail lsu:max-gg-exclusive ?maxe_gg;
  lsu:max-gg-inclusive ?maxi_gg;
  lsu:min-gg-inclusive ?mine_gg;
  lsu:min-gg-exclusive ?mini_gg.
FILTER (
  (?maxe_gg = rdf:nil || ?maxe_gg > "50"^^xsd:decimal)
  && (?maxi_gg = rdf:nil || ?maxi_gg >= "50"^^xsd:decimal)
  && (?mine_gg = rdf:nil || ?mine_gg < "50"^^xsd:decimal)
  && (?mini_gg = rdf:nil || ?mini_gg <= "50"^^xsd:decimal)
)
?detail lsu:h-m ?h_m.
FILTER (?h_m in (rdf:nil, <http://www.example.org/2015/7/hm1>))
{{
?detail lsu:v-func ?v_func.
FILTER (?v_func in
(<http://www.example.org/2015/7/vf1>,<http://www.example.org/2015/7/vf2>))
} UNION {
?detail lsu:c-n ?c_n.
FILTER (?c_n in
(<http://www.example.org/2015/7/cn1>,<http://www.example.org/2015/7/cn2>,<http://www.example.org/2015/7/cn3>,<http://www.example.org/2015/7/cn4>))
}}
?lsu lsu:lsu-d ?detail.
?lsu lsu:aF ?node.
}


On Thu, 2016-01-07 at 12:36 +0000, Andy Seaborne wrote:
> It looks like it is the query cost and not the
> 
> > So I conclude we are seeing the best performance possible unless there
> > is something terribly wrong with my queries. They are essentially of the
> > form:
> >
> 
> Details matter here - can you show a real query?
> 
> > select ?s
> > where {
> > ?nd :prop1 <uri1>;
> >   :prop2 "lit1";
> >   :prop3 ?var1;
> >   :prop4 ?var2;
> > # more properties of ?s
> 
> ?s doesn't appear until later.
> 
> There is a chance there are cross products in the real query.
> 
> > filter (?var1 > N1 && ?var1 < N2)
> > filter (?var2 in (<uriA>,<uriB>,...))
> 
> This usually gets optimized - maybe something else in your query is 
> blocking that.
> 
> Filter order can matter as well.
> 
> > #more filters on ?nd properties
> > ?s :p1/:p2 ?nd.
> > }
> >
> > Some of the filters get a little more complicated. And there is at least
> > one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
> > queries and run each individual piece (triple + filter), and it seems to
> > be the more complicated filters that start to slow things down, as might
> > be expected.
> >
> > Thanks for your comments and interest. The performance we're seeing is
> > unacceptable for our application requirements, so I wanted to see if
> > there were any other performance factors I had missed.
> 
> 	Andy
> 
> On 07/01/16 08:48, Håvard Mikkelsen Ottestad wrote:
> > Hi,
> >
> > Reordering the filters might help.
> >
> > Also, maybe a stats file would reorder your query to be faster. I dunno how often (or if) fuseki generates a stats file. You can try to generate one by hand when fuseki is shutdown: https://jena.apache.org/documentation/tdb/optimizer.html
> >
> > Also I’m wondering what the performance is like if you take this line away:
> > ?s :p1/:p2 ?nd.
> >
> >
> > One major performance drain I have seen in the past is filters on string literals. Especially if you are doing anything like CONTAINS or LOWERCASE. Do you have any of that?
> >
> > Håvard
> >
> >
> >
> >
> > On 07/01/16 03:51, "Paul Tyson" <ph...@sbcglobal.net> wrote:
> >
> >> On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
> >>> Hi Paul,
> >>>
> >>>   > My question is: is total query time limited by search execution speed,
> >>>   > or by marshaling and serialization of search results?
> >>>
> >>> Costs are a bit of both but normally mainly query.  It also depends on
> >>> the client processing.
> >>>
> >>> Some context please:
> >>> 1/ What's the storage layer?
> >> TDB behind fuseki 2.3.1
> >>
> >>> 2/ What result set format are you getting?
> >> text/csv
> >>
> >>> 3/ How are you handling the results on receipt in the client?
> >> Just writing them to file for testing.
> >>
> >>>
> >>> (Håvard point about seeing data and query also applies)
> >> Sorry, not easy to share the data.
> >>
> >>>
> >>> The important point is that output is streamed.
> >>>
> >>> Result sent while the query is execution; it is not the case that the
> >>> query executes,. all the results calculated and then results are produced.
> >>>
> >>> To investigate, modify the query to do something like this
> >>>
> >>> SELECT (count(*) AS ?C) { ... }
> >>>
> >>> because then the result set cost is low and all the query is executed
> >>> before a result can be produced.
> >>>
> >> Yes, I did that, and the time is very nearly the same.
> >>
> >> So I conclude we are seeing the best performance possible unless there
> >> is something terribly wrong with my queries. They are essentially of the
> >> form:
> >>
> >> select ?s
> >> where {
> >> ?nd :prop1 <uri1>;
> >>   :prop2 "lit1";
> >>   :prop3 ?var1;
> >>   :prop4 ?var2;
> >> # more properties of ?s
> >> filter (?var1 > N1 && ?var1 < N2)
> >> filter (?var2 in (<uriA>,<uriB>,...))
> >> #more filters on ?nd properties
> >> ?s :p1/:p2 ?nd.
> >> }
> >>
> >> Some of the filters get a little more complicated. And there is at least
> >> one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
> >> queries and run each individual piece (triple + filter), and it seems to
> >> be the more complicated filters that start to slow things down, as might
> >> be expected.
> >>
> >> Thanks for your comments and interest. The performance we're seeing is
> >> unacceptable for our application requirements, so I wanted to see if
> >> there were any other performance factors I had missed.
> >>
> >> Regards,
> >> --Paul
> >>
> >>>      Andy
> >>>
> >>>
> >>> On 06/01/16 16:17, Paul Tyson wrote:
> >>>> I have a modest (17M triple) dataset, fairly flat graph. I run some
> >>>> queries selecting nodes with anywhere from 12-20 different property
> >>>> values.
> >>>>
> >>>> Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> >>>> execution time measured at client are in the 30-40 second range.
> >>>>
> >>>> The web request begins streaming results immediately, but seems to take
> >>>> longer than it should (based on the number of results and size of data
> >>>> transfer). I also notice that the time is roughly linear with the size
> >>>> of dataset--halving the dataset size halves the result set and the
> >>>> execution time. I wouldn't have expected this behavior if all the time
> >>>> was due to an indexed search.
> >>>>
> >>>> My question is: is total query time limited by search execution speed,
> >>>> or by marshaling and serialization of search results?
> >>>>
> >>>> I have tried different query patterns, and believe I have the best
> >>>> queries possible for the use case.
> >>>>
> >>>> I'm looking for other suggestions to reduce overall execution time. The
> >>>> performance does not improve drastically going from 4Gb to 8 or 16Gb
> >>>> RAM. My test platforms are 64-bit Windows, ranging from small server
> >>>> (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
> >>>>
> >>>> Thanks,
> >>>> --Paul
> >>>>
> >>>
> >>
> >>
>

Re: optimizing serialization of results from fuseki

Posted by Andy Seaborne <an...@apache.org>.

It looks like it is the query cost and not the

> So I conclude we are seeing the best performance possible unless there
> is something terribly wrong with my queries. They are essentially of the
> form:
>

Details matter here - can you show a real query?

> select ?s
> where {
> ?nd :prop1 <uri1>;
>   :prop2 "lit1";
>   :prop3 ?var1;
>   :prop4 ?var2;
> # more properties of ?s

?s doesn't appear until later.

There is a chance there are cross products in the real query.

> filter (?var1 > N1 && ?var1 < N2)
> filter (?var2 in (<uriA>,<uriB>,...))

This usually gets optimized - maybe something else in your query is 
blocking that.

Filter order can matter as well.

> #more filters on ?nd properties
> ?s :p1/:p2 ?nd.
> }
>
> Some of the filters get a little more complicated. And there is at least
> one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
> queries and run each individual piece (triple + filter), and it seems to
> be the more complicated filters that start to slow things down, as might
> be expected.
>
> Thanks for your comments and interest. The performance we're seeing is
> unacceptable for our application requirements, so I wanted to see if
> there were any other performance factors I had missed.

	Andy

On 07/01/16 08:48, Håvard Mikkelsen Ottestad wrote:
> Hi,
>
> Reordering the filters might help.
>
> Also, maybe a stats file would reorder your query to be faster. I dunno how often (or if) fuseki generates a stats file. You can try to generate one by hand when fuseki is shutdown: https://jena.apache.org/documentation/tdb/optimizer.html
>
> Also I’m wondering what the performance is like if you take this line away:
> ?s :p1/:p2 ?nd.
>
>
> One major performance drain I have seen in the past is filters on string literals. Especially if you are doing anything like CONTAINS or LOWERCASE. Do you have any of that?
>
> Håvard
>
>
>
>
> On 07/01/16 03:51, "Paul Tyson" <ph...@sbcglobal.net> wrote:
>
>> On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
>>> Hi Paul,
>>>
>>>   > My question is: is total query time limited by search execution speed,
>>>   > or by marshaling and serialization of search results?
>>>
>>> Costs are a bit of both but normally mainly query.  It also depends on
>>> the client processing.
>>>
>>> Some context please:
>>> 1/ What's the storage layer?
>> TDB behind fuseki 2.3.1
>>
>>> 2/ What result set format are you getting?
>> text/csv
>>
>>> 3/ How are you handling the results on receipt in the client?
>> Just writing them to file for testing.
>>
>>>
>>> (Håvard point about seeing data and query also applies)
>> Sorry, not easy to share the data.
>>
>>>
>>> The important point is that output is streamed.
>>>
>>> Result sent while the query is execution; it is not the case that the
>>> query executes,. all the results calculated and then results are produced.
>>>
>>> To investigate, modify the query to do something like this
>>>
>>> SELECT (count(*) AS ?C) { ... }
>>>
>>> because then the result set cost is low and all the query is executed
>>> before a result can be produced.
>>>
>> Yes, I did that, and the time is very nearly the same.
>>
>> So I conclude we are seeing the best performance possible unless there
>> is something terribly wrong with my queries. They are essentially of the
>> form:
>>
>> select ?s
>> where {
>> ?nd :prop1 <uri1>;
>>   :prop2 "lit1";
>>   :prop3 ?var1;
>>   :prop4 ?var2;
>> # more properties of ?s
>> filter (?var1 > N1 && ?var1 < N2)
>> filter (?var2 in (<uriA>,<uriB>,...))
>> #more filters on ?nd properties
>> ?s :p1/:p2 ?nd.
>> }
>>
>> Some of the filters get a little more complicated. And there is at least
>> one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
>> queries and run each individual piece (triple + filter), and it seems to
>> be the more complicated filters that start to slow things down, as might
>> be expected.
>>
>> Thanks for your comments and interest. The performance we're seeing is
>> unacceptable for our application requirements, so I wanted to see if
>> there were any other performance factors I had missed.
>>
>> Regards,
>> --Paul
>>
>>>      Andy
>>>
>>>
>>> On 06/01/16 16:17, Paul Tyson wrote:
>>>> I have a modest (17M triple) dataset, fairly flat graph. I run some
>>>> queries selecting nodes with anywhere from 12-20 different property
>>>> values.
>>>>
>>>> Result set counts are anywhere from 10,000 to 30,000 nodes. Total
>>>> execution time measured at client are in the 30-40 second range.
>>>>
>>>> The web request begins streaming results immediately, but seems to take
>>>> longer than it should (based on the number of results and size of data
>>>> transfer). I also notice that the time is roughly linear with the size
>>>> of dataset--halving the dataset size halves the result set and the
>>>> execution time. I wouldn't have expected this behavior if all the time
>>>> was due to an indexed search.
>>>>
>>>> My question is: is total query time limited by search execution speed,
>>>> or by marshaling and serialization of search results?
>>>>
>>>> I have tried different query patterns, and believe I have the best
>>>> queries possible for the use case.
>>>>
>>>> I'm looking for other suggestions to reduce overall execution time. The
>>>> performance does not improve drastically going from 4Gb to 8 or 16Gb
>>>> RAM. My test platforms are 64-bit Windows, ranging from small server
>>>> (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
>>>>
>>>> Thanks,
>>>> --Paul
>>>>
>>>
>>
>>

Re: optimizing serialization of results from fuseki

Posted by "A. Soroka" <aj...@virginia.edu>.

I may be very wrong about this (I would appreciate correction if I am), but I don’t think Fuseki or TDB does anything to keep  stats.opt updated; that is, if you want it to be up-to-date with updates that have occurred, you need to do that yourself.

If that is correct, then if you generated the dataset with TDB’s bulk loaders, you started with a good stats.opt, but if you did piecemeal updating either directly or through Fuseki, its contents may have gotten out of whack with the data.

https://jena.apache.org/documentation/tdb/optimizer.htm

---
A. Soroka
The University of Virginia Library

> On Jan 7, 2016, at 11:06 AM, Paul Tyson <ph...@sbcglobal.net> wrote:
> 
>> Also, maybe a stats file would reorder your query to be faster. I dunno how often (or if) fuseki generates a stats file. You can try to generate one by hand when fuseki is shutdown: https://jena.apache.org/documentation/tdb/optimizer.html
>> 
> There is a stats.opt file so I assume it's using that for optimization.

Re: optimizing serialization of results from fuseki

Posted by Paul Tyson <ph...@sbcglobal.net>.

On Thu, 2016-01-07 at 08:48 +0000, Håvard Mikkelsen Ottestad wrote:
> Hi,
> 
> Reordering the filters might help.
I've tried a bit of this, without much noticeable effect. I understand
that these are pretty well optimized when the sparql text is converted
to algebra.

> 
> Also, maybe a stats file would reorder your query to be faster. I dunno how often (or if) fuseki generates a stats file. You can try to generate one by hand when fuseki is shutdown: https://jena.apache.org/documentation/tdb/optimizer.html
> 
There is a stats.opt file so I assume it's using that for optimization.

> Also I’m wondering what the performance is like if you take this line away: 
> ?s :p1/:p2 ?nd.
> 
When this occurs last in the BGP it does not have much effect. At the
top of the BGP it really slowed things down.

> 
> One major performance drain I have seen in the past is filters on string literals. Especially if you are doing anything like CONTAINS or LOWERCASE. Do you have any of that?
> 
I think all my filters are on numeric literals or URIs (see sample query
in other post). On other projects I also have noticed the impact of
string filters, and got much better results using Lucene add-on for
that.

Regards,
--Paul

> Håvard
> 
> 
> 
> 
> On 07/01/16 03:51, "Paul Tyson" <ph...@sbcglobal.net> wrote:
> 
> >On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
> >> Hi Paul,
> >> 
> >>  > My question is: is total query time limited by search execution speed,
> >>  > or by marshaling and serialization of search results?
> >> 
> >> Costs are a bit of both but normally mainly query.  It also depends on 
> >> the client processing.
> >> 
> >> Some context please:
> >> 1/ What's the storage layer?
> >TDB behind fuseki 2.3.1
> >
> >> 2/ What result set format are you getting?
> >text/csv
> >
> >> 3/ How are you handling the results on receipt in the client?
> >Just writing them to file for testing.
> >
> >> 
> >> (Håvard point about seeing data and query also applies)
> >Sorry, not easy to share the data.
> >
> >> 
> >> The important point is that output is streamed.
> >> 
> >> Result sent while the query is execution; it is not the case that the 
> >> query executes,. all the results calculated and then results are produced.
> >> 
> >> To investigate, modify the query to do something like this
> >> 
> >> SELECT (count(*) AS ?C) { ... }
> >> 
> >> because then the result set cost is low and all the query is executed 
> >> before a result can be produced.
> >> 
> >Yes, I did that, and the time is very nearly the same.
> >
> >So I conclude we are seeing the best performance possible unless there
> >is something terribly wrong with my queries. They are essentially of the
> >form:
> >
> >select ?s
> >where {
> >?nd :prop1 <uri1>;
> >  :prop2 "lit1";
> >  :prop3 ?var1;
> >  :prop4 ?var2;
> ># more properties of ?s
> >filter (?var1 > N1 && ?var1 < N2)
> >filter (?var2 in (<uriA>,<uriB>,...))
> >#more filters on ?nd properties
> >?s :p1/:p2 ?nd.
> >}
> >
> >Some of the filters get a little more complicated. And there is at least
> >one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
> >queries and run each individual piece (triple + filter), and it seems to
> >be the more complicated filters that start to slow things down, as might
> >be expected.
> >
> >Thanks for your comments and interest. The performance we're seeing is
> >unacceptable for our application requirements, so I wanted to see if
> >there were any other performance factors I had missed.
> >
> >Regards,
> >--Paul
> >
> >>     Andy
> >> 
> >> 
> >> On 06/01/16 16:17, Paul Tyson wrote:
> >> > I have a modest (17M triple) dataset, fairly flat graph. I run some
> >> > queries selecting nodes with anywhere from 12-20 different property
> >> > values.
> >> >
> >> > Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> >> > execution time measured at client are in the 30-40 second range.
> >> >
> >> > The web request begins streaming results immediately, but seems to take
> >> > longer than it should (based on the number of results and size of data
> >> > transfer). I also notice that the time is roughly linear with the size
> >> > of dataset--halving the dataset size halves the result set and the
> >> > execution time. I wouldn't have expected this behavior if all the time
> >> > was due to an indexed search.
> >> >
> >> > My question is: is total query time limited by search execution speed,
> >> > or by marshaling and serialization of search results?
> >> >
> >> > I have tried different query patterns, and believe I have the best
> >> > queries possible for the use case.
> >> >
> >> > I'm looking for other suggestions to reduce overall execution time. The
> >> > performance does not improve drastically going from 4Gb to 8 or 16Gb
> >> > RAM. My test platforms are 64-bit Windows, ranging from small server
> >> > (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
> >> >
> >> > Thanks,
> >> > --Paul
> >> >
> >> 
> >
> >

Re: optimizing serialization of results from fuseki

Posted by Håvard Mikkelsen Ottestad <ha...@acando.no>.

Hi,

Reordering the filters might help.

Also, maybe a stats file would reorder your query to be faster. I dunno how often (or if) fuseki generates a stats file. You can try to generate one by hand when fuseki is shutdown: https://jena.apache.org/documentation/tdb/optimizer.html

Also I’m wondering what the performance is like if you take this line away: 
?s :p1/:p2 ?nd.


One major performance drain I have seen in the past is filters on string literals. Especially if you are doing anything like CONTAINS or LOWERCASE. Do you have any of that?

Håvard




On 07/01/16 03:51, "Paul Tyson" <ph...@sbcglobal.net> wrote:

>On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
>> Hi Paul,
>> 
>>  > My question is: is total query time limited by search execution speed,
>>  > or by marshaling and serialization of search results?
>> 
>> Costs are a bit of both but normally mainly query.  It also depends on 
>> the client processing.
>> 
>> Some context please:
>> 1/ What's the storage layer?
>TDB behind fuseki 2.3.1
>
>> 2/ What result set format are you getting?
>text/csv
>
>> 3/ How are you handling the results on receipt in the client?
>Just writing them to file for testing.
>
>> 
>> (Håvard point about seeing data and query also applies)
>Sorry, not easy to share the data.
>
>> 
>> The important point is that output is streamed.
>> 
>> Result sent while the query is execution; it is not the case that the 
>> query executes,. all the results calculated and then results are produced.
>> 
>> To investigate, modify the query to do something like this
>> 
>> SELECT (count(*) AS ?C) { ... }
>> 
>> because then the result set cost is low and all the query is executed 
>> before a result can be produced.
>> 
>Yes, I did that, and the time is very nearly the same.
>
>So I conclude we are seeing the best performance possible unless there
>is something terribly wrong with my queries. They are essentially of the
>form:
>
>select ?s
>where {
>?nd :prop1 <uri1>;
>  :prop2 "lit1";
>  :prop3 ?var1;
>  :prop4 ?var2;
># more properties of ?s
>filter (?var1 > N1 && ?var1 < N2)
>filter (?var2 in (<uriA>,<uriB>,...))
>#more filters on ?nd properties
>?s :p1/:p2 ?nd.
>}
>
>Some of the filters get a little more complicated. And there is at least
>one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
>queries and run each individual piece (triple + filter), and it seems to
>be the more complicated filters that start to slow things down, as might
>be expected.
>
>Thanks for your comments and interest. The performance we're seeing is
>unacceptable for our application requirements, so I wanted to see if
>there were any other performance factors I had missed.
>
>Regards,
>--Paul
>
>>     Andy
>> 
>> 
>> On 06/01/16 16:17, Paul Tyson wrote:
>> > I have a modest (17M triple) dataset, fairly flat graph. I run some
>> > queries selecting nodes with anywhere from 12-20 different property
>> > values.
>> >
>> > Result set counts are anywhere from 10,000 to 30,000 nodes. Total
>> > execution time measured at client are in the 30-40 second range.
>> >
>> > The web request begins streaming results immediately, but seems to take
>> > longer than it should (based on the number of results and size of data
>> > transfer). I also notice that the time is roughly linear with the size
>> > of dataset--halving the dataset size halves the result set and the
>> > execution time. I wouldn't have expected this behavior if all the time
>> > was due to an indexed search.
>> >
>> > My question is: is total query time limited by search execution speed,
>> > or by marshaling and serialization of search results?
>> >
>> > I have tried different query patterns, and believe I have the best
>> > queries possible for the use case.
>> >
>> > I'm looking for other suggestions to reduce overall execution time. The
>> > performance does not improve drastically going from 4Gb to 8 or 16Gb
>> > RAM. My test platforms are 64-bit Windows, ranging from small server
>> > (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
>> >
>> > Thanks,
>> > --Paul
>> >
>> 
>
>

Re: optimizing serialization of results from fuseki

Posted by Paul Tyson <ph...@sbcglobal.net>.

On Wed, 2016-01-06 at 18:52 +0000, Andy Seaborne wrote:
> Hi Paul,
> 
>  > My question is: is total query time limited by search execution speed,
>  > or by marshaling and serialization of search results?
> 
> Costs are a bit of both but normally mainly query.  It also depends on 
> the client processing.
> 
> Some context please:
> 1/ What's the storage layer?
TDB behind fuseki 2.3.1

> 2/ What result set format are you getting?
text/csv

> 3/ How are you handling the results on receipt in the client?
Just writing them to file for testing.

> 
> (Håvard point about seeing data and query also applies)
Sorry, not easy to share the data.

> 
> The important point is that output is streamed.
> 
> Result sent while the query is execution; it is not the case that the 
> query executes,. all the results calculated and then results are produced.
> 
> To investigate, modify the query to do something like this
> 
> SELECT (count(*) AS ?C) { ... }
> 
> because then the result set cost is low and all the query is executed 
> before a result can be produced.
> 
Yes, I did that, and the time is very nearly the same.

So I conclude we are seeing the best performance possible unless there
is something terribly wrong with my queries. They are essentially of the
form:

select ?s
where {
?nd :prop1 <uri1>;
  :prop2 "lit1";
  :prop3 ?var1;
  :prop4 ?var2;
# more properties of ?s
filter (?var1 > N1 && ?var1 < N2)
filter (?var2 in (<uriA>,<uriB>,...))
#more filters on ?nd properties
?s :p1/:p2 ?nd.
}

Some of the filters get a little more complicated. And there is at least
one, possibly 2, UNION clauses. No OPTIONAL clauses. I've dissected the
queries and run each individual piece (triple + filter), and it seems to
be the more complicated filters that start to slow things down, as might
be expected.

Thanks for your comments and interest. The performance we're seeing is
unacceptable for our application requirements, so I wanted to see if
there were any other performance factors I had missed.

Regards,
--Paul

>     Andy
> 
> 
> On 06/01/16 16:17, Paul Tyson wrote:
> > I have a modest (17M triple) dataset, fairly flat graph. I run some
> > queries selecting nodes with anywhere from 12-20 different property
> > values.
> >
> > Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> > execution time measured at client are in the 30-40 second range.
> >
> > The web request begins streaming results immediately, but seems to take
> > longer than it should (based on the number of results and size of data
> > transfer). I also notice that the time is roughly linear with the size
> > of dataset--halving the dataset size halves the result set and the
> > execution time. I wouldn't have expected this behavior if all the time
> > was due to an indexed search.
> >
> > My question is: is total query time limited by search execution speed,
> > or by marshaling and serialization of search results?
> >
> > I have tried different query patterns, and believe I have the best
> > queries possible for the use case.
> >
> > I'm looking for other suggestions to reduce overall execution time. The
> > performance does not improve drastically going from 4Gb to 8 or 16Gb
> > RAM. My test platforms are 64-bit Windows, ranging from small server
> > (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
> >
> > Thanks,
> > --Paul
> >
>

Re: optimizing serialization of results from fuseki

Posted by Andy Seaborne <an...@apache.org>.

Hi Paul,

 > My question is: is total query time limited by search execution speed,
 > or by marshaling and serialization of search results?

Costs are a bit of both but normally mainly query.  It also depends on 
the client processing.

Some context please:
1/ What's the storage layer?
2/ What result set format are you getting?
3/ How are you handling the results on receipt in the client?

(Håvard point about seeing data and query also applies)

The important point is that output is streamed.

Result sent while the query is execution; it is not the case that the 
query executes,. all the results calculated and then results are produced.

To investigate, modify the query to do something like this

SELECT (count(*) AS ?C) { ... }

because then the result set cost is low and all the query is executed 
before a result can be produced.

    Andy


On 06/01/16 16:17, Paul Tyson wrote:
> I have a modest (17M triple) dataset, fairly flat graph. I run some
> queries selecting nodes with anywhere from 12-20 different property
> values.
>
> Result set counts are anywhere from 10,000 to 30,000 nodes. Total
> execution time measured at client are in the 30-40 second range.
>
> The web request begins streaming results immediately, but seems to take
> longer than it should (based on the number of results and size of data
> transfer). I also notice that the time is roughly linear with the size
> of dataset--halving the dataset size halves the result set and the
> execution time. I wouldn't have expected this behavior if all the time
> was due to an indexed search.
>
> My question is: is total query time limited by search execution speed,
> or by marshaling and serialization of search results?
>
> I have tried different query patterns, and believe I have the best
> queries possible for the use case.
>
> I'm looking for other suggestions to reduce overall execution time. The
> performance does not improve drastically going from 4Gb to 8 or 16Gb
> RAM. My test platforms are 64-bit Windows, ranging from small server
> (16Gb RAM, 4 CPU) to laptops with 4Gb RAM.
>
> Thanks,
> --Paul
>