You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Martin Van Aken <ma...@joyouscoding.com> on 2021/05/06 08:54:47 UTC

Jena / Fuseki / SPARQL performance (new to the tech)

Hi!
I'm Martin, I'm a software developer new to the Triples/SPARQL world. I'm
currently building queries against a Fuseki/TDB backend (that I can work on
too) and I'm getting into significant performance problems (including never
ending queries). Despite what I thought was a good search on the apache
jena website I could not find a lot of insight about performance
investigation so I'm trying it here.

Most of my data experience comes from the relational world (ex: PG) so I'm
sometimes drawing comparisons there.

To give some context my data set is around 15 linked concepts, with the
number of triples for each ranging from some hundreds to 500K - total less
than 2 millions (documents/authors/publication kind of data).

Unto questions:

   - When I'm facing a slow query, what are my investigation options. Is
   there an equivalent of an "explain plan" in SQL pointing to the query
   specific slow points? What's the advised way for performance checks in
   SPARQL?
   - Are there any performance setups to be aware of on the server side?
   Like ways to check indexes are correctly built (outside of text search that
   I'm not working with for the moment)
   - We're currently using TDB1. I've seen the transactional benefits of
   TDB2 - are there performance improvements too that would warrant a
   migration there ?

Thanks a lot already!

Martin
-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : martin@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
I'm probably misunderstanding this graph pattern or the data

> ?paper    iospress:publicationDate  ?pubDate ;
>                   iospress:publicationIncludesKeyword ?keyword;
>                   iospress:publicationAuthorList  [ ?idx ?author ] .

but just in case this author list is an RDF list, then it won't work as 
intended to get all authors with position. Or did you really model it as 
a bunch of bnodes with a separate property per index? I mean, the ?idx 
has to be a URI.

If it is a list, maybe ARQ extension might useful: 
https://jena.apache.org/documentation/query/rdf_lists.html


Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martin Van Aken <ma...@joyouscoding.com>.
Hi Martynas,
Thanks a lot - that was exactly what I was wondering as indeed a lot of my
variables are just there to make sure I have them on the output but are
probably extending the search space a lot for no good reason. Did not know
I could wrap a query in a describe, this splitting the "what I want to see"
part and the "what I'm search on" part. Going to try this ASAP.

Thanks!

Martin

On Thu, 20 May 2021 at 11:12, Martynas Jusevičius <ma...@atomgraph.com>
wrote:

> Martin,
>
> Some of the OPTIONAL variables don't seem to be used anywhere else in the
> query.
>
> Rather than using SELECT to pull the data fields, can't you use it to
> only filter down the entities of interest, and wrap the whole thing
> into a DESCRIBE to retrieve their full descriptions as graphs?
> Something like:
>
> PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> PREFIX  iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> PREFIX  xsd:  <http://www.w3.org/2001/XMLSchema#>
> PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> PREFIX  iospress: <http://ld.iospress.nl/rdf/ontology/>
> PREFIX  iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
>
> DESCRIBE *
> WHERE
>   { SELECT  ?paper ?author ?issueOrBook ?access ?journal
>     WHERE
>       {   { ?paper  rdf:type  iospress:Chapter }
>         UNION
>           { ?paper  rdf:type  iospress:Article }
>         ?paper    iospress:publicationDate  ?pubDate ;
>                  iospress:publicationIncludesKeyword ?keyword;
>                  iospress:publicationAuthorList  [ ?idx ?author ] .
>         ?issueOrBook  iospress:partOf   ?volumeOrSerie .
>         ?paper    iospress:partOf       ?issueOrBook
>         OPTIONAL
>           { ?paper  iospress:publicationAccessibility  ?access }
>         OPTIONAL
>           { ?volumeOrSerie
>                       iospress:partOf  ?journal
>           }
>         FILTER ( ( ( ( ( datatype(?pubDate) = xsd:date ) && (
> xsd:dateTime(?pubDate) > "1999-12-31T23:00:00.000Z"^^xsd:dateTime ) )
> && ( xsd:dateTime(?pubDate) < "2021-05-18T12:16:58.841Z"^^xsd:dateTime
> ) ) || ( ( ( datatype(?pubDate) = xsd:gYear ) && ( ?pubDate >=
> "2000"^^xsd:gYear ) ) && ( ?pubDate <= "2021"^^xsd:gYear ) ) ) &&
> regex(?keyword, "sickness", "i") )
>       }
>     ORDER BY ?pubDate ?paper
>     LIMIT   50
>   }
>
> On Thu, May 20, 2021 at 10:44 AM Martin Van Aken
> <ma...@joyouscoding.com> wrote:
> >
> > Andy,
> > A big thanks for this - it gives me some paths to explore. I think indeed
> > my biggest problems are in the optional parts - I'll run the test you
> > advised and also look in which case I may be able to get rid of the
> > optionals to avoid those situations that could lead to a big amount of
> > results as you mentioned. I'm already looking at getting my filters
> closer
> > to definition - can this be done for things other than pure equality (for
> > example for the date that are testing for a range?).
> >
> > Maybe one question about optional - I use them in some cases to avoid
> empty
> > results. An example is Access - some paper have an Access triple (Open or
> > Closed) - but some have none. My understanding is that if I make a link
> > without optional like:
> >
> > ?paper iospress:accessibility ?access
> >
> > this will de facto remove all papers without access from the set. This is
> > something I don't want (I want them in the list, just with an empty value
> > there) - and my understanding is that the way to manage this is an
> > Optional. Is this correct? Is there a "better" way? If this ends up being
> > costly, I could also check to actually have a value for those (those
> > without value are technically "Closed").
> >
> > Something I was wondering also is whether it makes sense to split the
> > fields I need for search/filtering vs the ones I want to see on the
> result.
> > I've a feeling that in theory I could play with two queries - one with
> only
> > the params I need for the filtering, then play something similar to
> > DESCRIBE on each record on the filtered set - but I've no idea if this
> > would be more performant than keeping it together as it is now.
> >
> > Anyway, the exchanges here are much appreciated!
> >
> > On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:
> >
> > > Martin,
> > >
> > > That's a complicated query and I haven't got my head aroud it
> completely
> > > yet.
> > >
> > > There are some useful points to understand:
> > >
> > > A::
> > >
> > > What is the time and outcome of these queries that focus on the main
> > > data location part:
> > >
> > > 1/
> > >
> > > SELECT (count(*) AS ?C) {
> > >   ?paper  iospress:publicationDate ?pubDate
> > >   FILTER(...date test...)
> > > }
> > >
> > > 2/
> > >   SELECT (count(*) AS ?C) {
> > >   ?paper  iospress:publicationDate ?pubDate
> > >           iospress:publicationIncludesKeyword ?keyword .
> > >   FILETER (...date... && (regex (?keyword, "sickness", "i"))
> > >
> > > 3/
> > > SELECT (count(*) AS ?C) {
> > >    {?paper rdf:type iospress:Chapter.}
> > >              union
> > >    {?paper rdf:type iospress:Article.}
> > >    ?paper  iospress:publicationDate ?pubDate
> > >    FILTER(...date test))
> > > }
> > >
> > > 4/
> > > SELECT (count(*) AS ?C) {
> > >   ?paper  iospress:publicationDate ?pubDate
> > >   FILTER(.. date test...)
> > >    {?paper rdf:type iospress:Chapter.}
> > >              union
> > >    {?paper rdf:type iospress:Article.}
> > > }
> > >
> > > B::
> > >
> > > then is it the case that some optionals have more effect than others?
> > > Some are "high risk":
> > >
> > > ---
> > >      OPTIONAL {
> > >          ?author iospress:contributorAffiliation ?affiliation.
> > >          ?affiliation rdfs:label ?university;
> > >      }
> > >       OPTIONAL {
> > >        ?affiliation iospress:geocodingOutput ?geocoded.
> > >        ?geocoded iospress-geocode:country ?country
> > >      }
> > > ---
> > > Suppose the first does not match then the second is a lot of results
> > > unrelated to ?paper.
> > >
> > > C::
> > >
> > > distinct
> > >
> > > it might be worth trying without distinct because distinct can cause a
> > > lot of results to be reduced to just a few, hiding redundant work.
> > >
> > >      Andy
> > >
> > > On 18/05/2021 13:31, Martin Van Aken wrote:
> > > > Hello again,
> > > > After some more days of me trying to get a better performance & the
> > > > approval of my company, here is what we try to run (query at the
> bottom
> > > of
> > > > the mail).
> > > >
> > > > For some context:
> > > >
> > > > - This is a search for academia papers. Papers have multiple
> authors, and
> > > > authors are part of multiple universities. Papers also have multiple
> > > > keywords and are generally part of a set (an issue) itself part of a
> set
> > > (a
> > > > volume) itself part of a set (a journal).
> > > > - Our goal is to have a multicriteria search front end, so the query
> is
> > > > generated from a form with clauses selected by the user. The
> structure is
> > > > always the same, this example use a single condition on the "keyword"
> > > > - The set of data is relatively small - around 150k papers (so
> probably
> > > 1M
> > > > triples there), probably around 500k authors
> > > > - We use group/concat as we want to give as results one line per
> paper
> > > (vs
> > > > having one per paper per keyword for example)
> > > > - I've read OPTIONALS are pretty bad - but I've no real alternative
> here
> > > > that I know off when some fields can be present or not and I don't
> want
> > > to
> > > > throw away all that miss at least one
> > > >
> > > > For our current results, all but the most precise queries (getting
> into a
> > > > super limited set of papers, like <10) get extremely slow (easily to
> > > dozens
> > > > of seconds, sometimes more). I feel that there is something obvious
> that
> > > > I'm missing, either in the query or my Jena config. The server is on
> an
> > > old
> > > > version but I make my tests locally on a 4.0.0 "out of the box" (0
> > > > configuration).
> > > >
> > > > What I've tried:
> > > >
> > > > - Removing the ORDER does not impact much
> > > > - Removing most optionals works... but remove the point of the query
> > > > - Using contains instead of regex does not impact much (I've the
> goal to
> > > > use Jena/Lucene integration for everything text related)
> > > >
> > > > I'm really in for an opinion as taking my RDBMS background this is
> the
> > > > equivalent of less than 3M records split on around 8 tables -
> something
> > > > that should be queryable mostly in sub second times.
> > > >
> > > > Any feedback is most welcome !
> > > >
> > > > Martin
> > > >
> > > > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> > > >      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> > > >      PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> > > >      PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
> > > >      PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> > > >      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> > > >
> > > >      SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> > > >          (group_concat(distinct ?authorName;separator=", ") as
> ?Authors)
> > > >          (group_concat(distinct ?keyword;separator=", ") as
> ?keywords)
> > > >          (group_concat(distinct ?university;separator=", ") as
> > > ?universities)
> > > >          (group_concat(distinct ?country;separator=", ") as
> ?countries)
> > > >      WHERE {
> > > >          {?paper rdf:type iospress:Chapter.}
> > > >              union
> > > >          {?paper rdf:type iospress:Article.}
> > > >
> > > >          ?paper rdfs:label ?title;
> > > >                   rdf:type ?type;
> > > >
> > > >                   iospress:publicationDate ?pubDate;
> > > >                   iospress:publicationAbstract ?abstract;
> > > >
> > > >                   iospress:publicationIncludesKeyword ?keyword;
> > > >                   iospress:publicationAuthorList [?idx ?author].
> > > >
> > > >          ?issueOrBook iospress:partOf ?volumeOrSerie.
> > > >          ?paper iospress:partOf ?issueOrBook.
> > > >
> > > >
> > > >      OPTIONAL {
> > > >          ?issueOrBook iospress:isbn ?bookIsbn.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?paper iospress:publicationDoiUrl ?doi.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?author rdfs:label ?authorName.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?author iospress:contributorAffiliation ?affiliation.
> > > >          ?affiliation rdfs:label ?university;
> > > >      }
> > > >       OPTIONAL {
> > > >        ?affiliation iospress:geocodingOutput ?geocoded.
> > > >        ?geocoded iospress-geocode:country ?country
> > > >      }
> > > >      OPTIONAL {
> > > >          ?paper iospress:publicationAccessibility ?access.
> > > >      }
> > > >      OPTIONAL {
> > > >          ?volumeOrSerie iospress:partOf ?journal;
> > > >      }
> > > >      FILTER(
> > > >          (
> > > >              (datatype(?pubDate) = xsd:date &&
> xsd:dateTime(?pubDate) >
> > > > "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
> > > > "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> > > >              (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> > > > "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> > > >          )
> > > >
> > > >          && (regex (?keyword, "sickness", "i"))
> > > >          )
> > > >      }
> > > >      GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> > > >
> > > >      ORDER BY ?pubDate ?paper
> > > >      LIMIT 50
> > > >
> > > >
> > > > On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
> > > >
> > > >> Hi there,
> > > >>
> > > >> Showing the query would be helpful but some general remarks:
> > > >>
> > > >> 1/ If the query or the setup for Fuseki is needing more than the
> default
> > > >> heap size, then it might be that the Java JVM is getting into a
> state of
> > > >> heap exhaustion. This manifests as the CPU loading getting very
> high. It
> > > >> will seem like nothing is happening (waiting for response).
> > > >>
> > > >> 2/ The query may be expensive.
> > > >>
> > > >> Things to look for
> > > >> * cross products - two parts of the query pattern that are not
> > > >> connected.
> > > >>
> > > >> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> > > >>
> > > >> * sort, spilling to disk or combined with a cross product the query.
> > > >>
> > > >> 3/ If no results are coming back, then the query is form that does
> not
> > > >> stream back - sort, or CONSTRUCT maybe.
> > > >>
> > > >> There was a useful presentation recently that talks about the
> principles
> > > >> of query efficiency.
> > > >>
> > > >> SPARQL Query Optimization with Pavel Klinov
> > > >> https://www.youtube.com/watch?v=16eMswT2x2Y
> > > >>
> > > >> More inline:
> > > >>
> > > >> On 06/05/2021 09:54, Martin Van Aken wrote:
> > > >>> Hi!
> > > >>> I'm Martin, I'm a software developer new to the Triples/SPARQL
> world.
> > > I'm
> > > >>> currently building queries against a Fuseki/TDB backend (that I can
> > > work
> > > >> on
> > > >>> too) and I'm getting into significant performance problems
> (including
> > > >> never
> > > >>> ending queries).
> > > >>
> > > >> Are updates also happening at the same time?
> > > >>
> > > >>> Despite what I thought was a good search on the apache
> > > >>> jena website I could not find a lot of insight about performance
> > > >>> investigation so I'm trying it here.
> > > >>>
> > > >>> Most of my data experience comes from the relational world (ex:
> PG) so
> > > >> I'm
> > > >>> sometimes drawing comparisons there.
> > > >>>
> > > >>> To give some context my data set is around 15 linked concepts,
> with the
> > > >>> number of triples for each ranging from some hundreds to 500K -
> total
> > > >> less
> > > >>> than 2 millions (documents/authors/publication kind of data).
> > > >>>
> > > >>> Unto questions:
> > > >>>
> > > >>>      - When I'm facing a slow query, what are my investigation
> > > options. Is
> > > >>>      there an equivalent of an "explain plan" in SQL pointing to
> the
> > > query
> > > >>>      specific slow points? What's the advised way for performance
> > > checks
> > > >> in
> > > >>>      SPARQL?
> > > >>
> > > >> qparse --print=opt --file query.rq
> > > >>
> > > >>>      - Are there any performance setups to be aware of on the
> server
> > > side?
> > > >>>      Like ways to check indexes are correctly built (outside of
> text
> > > >> search that
> > > >>>      I'm not working with for the moment)
> > > >>>      - We're currently using TDB1. I've seen the transactional
> > > benefits of
> > > >>>      TDB2 - are there performance improvements too that would
> warrant a
> > > >>>      migration there ?
> > > >>
> > > >> Not on the query side.
> > > >>
> > > >>       Andy
> > > >>
> > > >>>
> > > >>> Thanks a lot already!
> > > >>>
> > > >>> Martin
> > > >>>
> > > >>
> > > >
> > > >
> > >
> >
> >
> > --
> > *Martin Van Aken - **Freelance Enthusiast Developer*
> >
> > Mobile : +32 486 899 652
> >
> > Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
> > Call me on Skype : vanakenm
> > Hang out with me : martin@joyouscoding.com
> > Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
> > Company website : www.joyouscoding.com
>


-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : martin@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martynas Jusevičius <ma...@atomgraph.com>.
Martin,

Some of the OPTIONAL variables don't seem to be used anywhere else in the query.

Rather than using SELECT to pull the data fields, can't you use it to
only filter down the entities of interest, and wrap the whole thing
into a DESCRIBE to retrieve their full descriptions as graphs?
Something like:

PREFIX  rdf:  <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX  iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
PREFIX  xsd:  <http://www.w3.org/2001/XMLSchema#>
PREFIX  rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX  iospress: <http://ld.iospress.nl/rdf/ontology/>
PREFIX  iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>

DESCRIBE *
WHERE
  { SELECT  ?paper ?author ?issueOrBook ?access ?journal
    WHERE
      {   { ?paper  rdf:type  iospress:Chapter }
        UNION
          { ?paper  rdf:type  iospress:Article }
        ?paper    iospress:publicationDate  ?pubDate ;
                 iospress:publicationIncludesKeyword ?keyword;
                 iospress:publicationAuthorList  [ ?idx ?author ] .
        ?issueOrBook  iospress:partOf   ?volumeOrSerie .
        ?paper    iospress:partOf       ?issueOrBook
        OPTIONAL
          { ?paper  iospress:publicationAccessibility  ?access }
        OPTIONAL
          { ?volumeOrSerie
                      iospress:partOf  ?journal
          }
        FILTER ( ( ( ( ( datatype(?pubDate) = xsd:date ) && (
xsd:dateTime(?pubDate) > "1999-12-31T23:00:00.000Z"^^xsd:dateTime ) )
&& ( xsd:dateTime(?pubDate) < "2021-05-18T12:16:58.841Z"^^xsd:dateTime
) ) || ( ( ( datatype(?pubDate) = xsd:gYear ) && ( ?pubDate >=
"2000"^^xsd:gYear ) ) && ( ?pubDate <= "2021"^^xsd:gYear ) ) ) &&
regex(?keyword, "sickness", "i") )
      }
    ORDER BY ?pubDate ?paper
    LIMIT   50
  }

On Thu, May 20, 2021 at 10:44 AM Martin Van Aken
<ma...@joyouscoding.com> wrote:
>
> Andy,
> A big thanks for this - it gives me some paths to explore. I think indeed
> my biggest problems are in the optional parts - I'll run the test you
> advised and also look in which case I may be able to get rid of the
> optionals to avoid those situations that could lead to a big amount of
> results as you mentioned. I'm already looking at getting my filters closer
> to definition - can this be done for things other than pure equality (for
> example for the date that are testing for a range?).
>
> Maybe one question about optional - I use them in some cases to avoid empty
> results. An example is Access - some paper have an Access triple (Open or
> Closed) - but some have none. My understanding is that if I make a link
> without optional like:
>
> ?paper iospress:accessibility ?access
>
> this will de facto remove all papers without access from the set. This is
> something I don't want (I want them in the list, just with an empty value
> there) - and my understanding is that the way to manage this is an
> Optional. Is this correct? Is there a "better" way? If this ends up being
> costly, I could also check to actually have a value for those (those
> without value are technically "Closed").
>
> Something I was wondering also is whether it makes sense to split the
> fields I need for search/filtering vs the ones I want to see on the result.
> I've a feeling that in theory I could play with two queries - one with only
> the params I need for the filtering, then play something similar to
> DESCRIBE on each record on the filtered set - but I've no idea if this
> would be more performant than keeping it together as it is now.
>
> Anyway, the exchanges here are much appreciated!
>
> On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:
>
> > Martin,
> >
> > That's a complicated query and I haven't got my head aroud it completely
> > yet.
> >
> > There are some useful points to understand:
> >
> > A::
> >
> > What is the time and outcome of these queries that focus on the main
> > data location part:
> >
> > 1/
> >
> > SELECT (count(*) AS ?C) {
> >   ?paper  iospress:publicationDate ?pubDate
> >   FILTER(...date test...)
> > }
> >
> > 2/
> >   SELECT (count(*) AS ?C) {
> >   ?paper  iospress:publicationDate ?pubDate
> >           iospress:publicationIncludesKeyword ?keyword .
> >   FILETER (...date... && (regex (?keyword, "sickness", "i"))
> >
> > 3/
> > SELECT (count(*) AS ?C) {
> >    {?paper rdf:type iospress:Chapter.}
> >              union
> >    {?paper rdf:type iospress:Article.}
> >    ?paper  iospress:publicationDate ?pubDate
> >    FILTER(...date test))
> > }
> >
> > 4/
> > SELECT (count(*) AS ?C) {
> >   ?paper  iospress:publicationDate ?pubDate
> >   FILTER(.. date test...)
> >    {?paper rdf:type iospress:Chapter.}
> >              union
> >    {?paper rdf:type iospress:Article.}
> > }
> >
> > B::
> >
> > then is it the case that some optionals have more effect than others?
> > Some are "high risk":
> >
> > ---
> >      OPTIONAL {
> >          ?author iospress:contributorAffiliation ?affiliation.
> >          ?affiliation rdfs:label ?university;
> >      }
> >       OPTIONAL {
> >        ?affiliation iospress:geocodingOutput ?geocoded.
> >        ?geocoded iospress-geocode:country ?country
> >      }
> > ---
> > Suppose the first does not match then the second is a lot of results
> > unrelated to ?paper.
> >
> > C::
> >
> > distinct
> >
> > it might be worth trying without distinct because distinct can cause a
> > lot of results to be reduced to just a few, hiding redundant work.
> >
> >      Andy
> >
> > On 18/05/2021 13:31, Martin Van Aken wrote:
> > > Hello again,
> > > After some more days of me trying to get a better performance & the
> > > approval of my company, here is what we try to run (query at the bottom
> > of
> > > the mail).
> > >
> > > For some context:
> > >
> > > - This is a search for academia papers. Papers have multiple authors, and
> > > authors are part of multiple universities. Papers also have multiple
> > > keywords and are generally part of a set (an issue) itself part of a set
> > (a
> > > volume) itself part of a set (a journal).
> > > - Our goal is to have a multicriteria search front end, so the query is
> > > generated from a form with clauses selected by the user. The structure is
> > > always the same, this example use a single condition on the "keyword"
> > > - The set of data is relatively small - around 150k papers (so probably
> > 1M
> > > triples there), probably around 500k authors
> > > - We use group/concat as we want to give as results one line per paper
> > (vs
> > > having one per paper per keyword for example)
> > > - I've read OPTIONALS are pretty bad - but I've no real alternative here
> > > that I know off when some fields can be present or not and I don't want
> > to
> > > throw away all that miss at least one
> > >
> > > For our current results, all but the most precise queries (getting into a
> > > super limited set of papers, like <10) get extremely slow (easily to
> > dozens
> > > of seconds, sometimes more). I feel that there is something obvious that
> > > I'm missing, either in the query or my Jena config. The server is on an
> > old
> > > version but I make my tests locally on a 4.0.0 "out of the box" (0
> > > configuration).
> > >
> > > What I've tried:
> > >
> > > - Removing the ORDER does not impact much
> > > - Removing most optionals works... but remove the point of the query
> > > - Using contains instead of regex does not impact much (I've the goal to
> > > use Jena/Lucene integration for everything text related)
> > >
> > > I'm really in for an opinion as taking my RDBMS background this is the
> > > equivalent of less than 3M records split on around 8 tables - something
> > > that should be queryable mostly in sub second times.
> > >
> > > Any feedback is most welcome !
> > >
> > > Martin
> > >
> > > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> > >      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> > >      PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> > >      PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
> > >      PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> > >      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> > >
> > >      SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> > >          (group_concat(distinct ?authorName;separator=", ") as ?Authors)
> > >          (group_concat(distinct ?keyword;separator=", ") as ?keywords)
> > >          (group_concat(distinct ?university;separator=", ") as
> > ?universities)
> > >          (group_concat(distinct ?country;separator=", ") as ?countries)
> > >      WHERE {
> > >          {?paper rdf:type iospress:Chapter.}
> > >              union
> > >          {?paper rdf:type iospress:Article.}
> > >
> > >          ?paper rdfs:label ?title;
> > >                   rdf:type ?type;
> > >
> > >                   iospress:publicationDate ?pubDate;
> > >                   iospress:publicationAbstract ?abstract;
> > >
> > >                   iospress:publicationIncludesKeyword ?keyword;
> > >                   iospress:publicationAuthorList [?idx ?author].
> > >
> > >          ?issueOrBook iospress:partOf ?volumeOrSerie.
> > >          ?paper iospress:partOf ?issueOrBook.
> > >
> > >
> > >      OPTIONAL {
> > >          ?issueOrBook iospress:isbn ?bookIsbn.
> > >      }
> > >      OPTIONAL {
> > >          ?paper iospress:publicationDoiUrl ?doi.
> > >      }
> > >      OPTIONAL {
> > >          ?author rdfs:label ?authorName.
> > >      }
> > >      OPTIONAL {
> > >          ?author iospress:contributorAffiliation ?affiliation.
> > >          ?affiliation rdfs:label ?university;
> > >      }
> > >       OPTIONAL {
> > >        ?affiliation iospress:geocodingOutput ?geocoded.
> > >        ?geocoded iospress-geocode:country ?country
> > >      }
> > >      OPTIONAL {
> > >          ?paper iospress:publicationAccessibility ?access.
> > >      }
> > >      OPTIONAL {
> > >          ?volumeOrSerie iospress:partOf ?journal;
> > >      }
> > >      FILTER(
> > >          (
> > >              (datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) >
> > > "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
> > > "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> > >              (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> > > "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> > >          )
> > >
> > >          && (regex (?keyword, "sickness", "i"))
> > >          )
> > >      }
> > >      GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> > >
> > >      ORDER BY ?pubDate ?paper
> > >      LIMIT 50
> > >
> > >
> > > On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
> > >
> > >> Hi there,
> > >>
> > >> Showing the query would be helpful but some general remarks:
> > >>
> > >> 1/ If the query or the setup for Fuseki is needing more than the default
> > >> heap size, then it might be that the Java JVM is getting into a state of
> > >> heap exhaustion. This manifests as the CPU loading getting very high. It
> > >> will seem like nothing is happening (waiting for response).
> > >>
> > >> 2/ The query may be expensive.
> > >>
> > >> Things to look for
> > >> * cross products - two parts of the query pattern that are not
> > >> connected.
> > >>
> > >> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> > >>
> > >> * sort, spilling to disk or combined with a cross product the query.
> > >>
> > >> 3/ If no results are coming back, then the query is form that does not
> > >> stream back - sort, or CONSTRUCT maybe.
> > >>
> > >> There was a useful presentation recently that talks about the principles
> > >> of query efficiency.
> > >>
> > >> SPARQL Query Optimization with Pavel Klinov
> > >> https://www.youtube.com/watch?v=16eMswT2x2Y
> > >>
> > >> More inline:
> > >>
> > >> On 06/05/2021 09:54, Martin Van Aken wrote:
> > >>> Hi!
> > >>> I'm Martin, I'm a software developer new to the Triples/SPARQL world.
> > I'm
> > >>> currently building queries against a Fuseki/TDB backend (that I can
> > work
> > >> on
> > >>> too) and I'm getting into significant performance problems (including
> > >> never
> > >>> ending queries).
> > >>
> > >> Are updates also happening at the same time?
> > >>
> > >>> Despite what I thought was a good search on the apache
> > >>> jena website I could not find a lot of insight about performance
> > >>> investigation so I'm trying it here.
> > >>>
> > >>> Most of my data experience comes from the relational world (ex: PG) so
> > >> I'm
> > >>> sometimes drawing comparisons there.
> > >>>
> > >>> To give some context my data set is around 15 linked concepts, with the
> > >>> number of triples for each ranging from some hundreds to 500K - total
> > >> less
> > >>> than 2 millions (documents/authors/publication kind of data).
> > >>>
> > >>> Unto questions:
> > >>>
> > >>>      - When I'm facing a slow query, what are my investigation
> > options. Is
> > >>>      there an equivalent of an "explain plan" in SQL pointing to the
> > query
> > >>>      specific slow points? What's the advised way for performance
> > checks
> > >> in
> > >>>      SPARQL?
> > >>
> > >> qparse --print=opt --file query.rq
> > >>
> > >>>      - Are there any performance setups to be aware of on the server
> > side?
> > >>>      Like ways to check indexes are correctly built (outside of text
> > >> search that
> > >>>      I'm not working with for the moment)
> > >>>      - We're currently using TDB1. I've seen the transactional
> > benefits of
> > >>>      TDB2 - are there performance improvements too that would warrant a
> > >>>      migration there ?
> > >>
> > >> Not on the query side.
> > >>
> > >>       Andy
> > >>
> > >>>
> > >>> Thanks a lot already!
> > >>>
> > >>> Martin
> > >>>
> > >>
> > >
> > >
> >
>
>
> --
> *Martin Van Aken - **Freelance Enthusiast Developer*
>
> Mobile : +32 486 899 652
>
> Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
> Call me on Skype : vanakenm
> Hang out with me : martin@joyouscoding.com
> Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
> Company website : www.joyouscoding.com

Re: Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
wouldn't you use a nested OPTIONAL here anyways?
> If it is just one triple in the optional is less likely to be bad but 
> if the query uses the variable unbound later on, there will be a very 
> large number of results, many duplicates and not actually related to 
> the ?paper. I am guessing but I would be surprised is your query has 
> variants of this and it is hidden by the "distinct".
>
> This is the problem at:
>
> >> ---
> >>       OPTIONAL {
> >>           ?author iospress:contributorAffiliation ?affiliation.
> >>           ?affiliation rdfs:label ?university;
> >>       }
> >>        OPTIONAL {
> >>         ?affiliation iospress:geocodingOutput ?geocoded.
> >>         ?geocoded iospress-geocode:country ?country
> >>       }
> >> ---
>
> If no ?affiliation, then the second OPTIONAL is over the whole 
> database which I'm guess is many results.
>
>     Andy
>
>> this will de facto remove all papers without access from the set. 
>> This is
>> something I don't want (I want them in the list, just with an empty 
>> value
>> there) - and my understanding is that the way to manage this is an
>> Optional. Is this correct? Is there a "better" way? If this ends up 
>> being
>> costly, I could also check to actually have a value for those (those
>> without value are technically "Closed").
>>
>> Something I was wondering also is whether it makes sense to split the
>> fields I need for search/filtering vs the ones I want to see on the 
>> result.
>> I've a feeling that in theory I could play with two queries - one 
>> with only
>> the params I need for the filtering, then play something similar to
>> DESCRIBE on each record on the filtered set - but I've no idea if this
>> would be more performant than keeping it together as it is now.
>>
>> Anyway, the exchanges here are much appreciated!
>>
>> On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:
>>
>>> Martin,
>>>
>>> That's a complicated query and I haven't got my head aroud it 
>>> completely
>>> yet.
>>>
>>> There are some useful points to understand:
>>>
>>> A::
>>>
>>> What is the time and outcome of these queries that focus on the main
>>> data location part:
>>>
>>> 1/
>>>
>>> SELECT (count(*) AS ?C) {
>>>    ?paper  iospress:publicationDate ?pubDate
>>>    FILTER(...date test...)
>>> }
>>>
>>> 2/
>>>    SELECT (count(*) AS ?C) {
>>>    ?paper  iospress:publicationDate ?pubDate
>>>            iospress:publicationIncludesKeyword ?keyword .
>>>    FILETER (...date... && (regex (?keyword, "sickness", "i"))
>>>
>>> 3/
>>> SELECT (count(*) AS ?C) {
>>>     {?paper rdf:type iospress:Chapter.}
>>>               union
>>>     {?paper rdf:type iospress:Article.}
>>>     ?paper  iospress:publicationDate ?pubDate
>>>     FILTER(...date test))
>>> }
>>>
>>> 4/
>>> SELECT (count(*) AS ?C) {
>>>    ?paper  iospress:publicationDate ?pubDate
>>>    FILTER(.. date test...)
>>>     {?paper rdf:type iospress:Chapter.}
>>>               union
>>>     {?paper rdf:type iospress:Article.}
>>> }
>>>
>>> B::
>>>
>>> then is it the case that some optionals have more effect than others?
>>> Some are "high risk":
>>>
>>> ---
>>>       OPTIONAL {
>>>           ?author iospress:contributorAffiliation ?affiliation.
>>>           ?affiliation rdfs:label ?university;
>>>       }
>>>        OPTIONAL {
>>>         ?affiliation iospress:geocodingOutput ?geocoded.
>>>         ?geocoded iospress-geocode:country ?country
>>>       }
>>> ---
>>> Suppose the first does not match then the second is a lot of results
>>> unrelated to ?paper.
>>>
>>> C::
>>>
>>> distinct
>>>
>>> it might be worth trying without distinct because distinct can cause a
>>> lot of results to be reduced to just a few, hiding redundant work.
>>>
>>>       Andy
>>>
>>> On 18/05/2021 13:31, Martin Van Aken wrote:
>>>> Hello again,
>>>> After some more days of me trying to get a better performance & the
>>>> approval of my company, here is what we try to run (query at the 
>>>> bottom
>>> of
>>>> the mail).
>>>>
>>>> For some context:
>>>>
>>>> - This is a search for academia papers. Papers have multiple 
>>>> authors, and
>>>> authors are part of multiple universities. Papers also have multiple
>>>> keywords and are generally part of a set (an issue) itself part of 
>>>> a set
>>> (a
>>>> volume) itself part of a set (a journal).
>>>> - Our goal is to have a multicriteria search front end, so the 
>>>> query is
>>>> generated from a form with clauses selected by the user. The 
>>>> structure is
>>>> always the same, this example use a single condition on the "keyword"
>>>> - The set of data is relatively small - around 150k papers (so 
>>>> probably
>>> 1M
>>>> triples there), probably around 500k authors
>>>> - We use group/concat as we want to give as results one line per paper
>>> (vs
>>>> having one per paper per keyword for example)
>>>> - I've read OPTIONALS are pretty bad - but I've no real alternative 
>>>> here
>>>> that I know off when some fields can be present or not and I don't 
>>>> want
>>> to
>>>> throw away all that miss at least one
>>>>
>>>> For our current results, all but the most precise queries (getting 
>>>> into a
>>>> super limited set of papers, like <10) get extremely slow (easily to
>>> dozens
>>>> of seconds, sometimes more). I feel that there is something obvious 
>>>> that
>>>> I'm missing, either in the query or my Jena config. The server is 
>>>> on an
>>> old
>>>> version but I make my tests locally on a 4.0.0 "out of the box" (0
>>>> configuration).
>>>>
>>>> What I've tried:
>>>>
>>>> - Removing the ORDER does not impact much
>>>> - Removing most optionals works... but remove the point of the query
>>>> - Using contains instead of regex does not impact much (I've the 
>>>> goal to
>>>> use Jena/Lucene integration for everything text related)
>>>>
>>>> I'm really in for an opinion as taking my RDBMS background this is the
>>>> equivalent of less than 3M records split on around 8 tables - 
>>>> something
>>>> that should be queryable mostly in sub second times.
>>>>
>>>> Any feedback is most welcome !
>>>>
>>>> Martin
>>>>
>>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>>>       PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
>>>>       PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
>>>>       PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
>>>>       PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>>>>
>>>>       SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
>>>>           (group_concat(distinct ?authorName;separator=", ") as 
>>>> ?Authors)
>>>>           (group_concat(distinct ?keyword;separator=", ") as 
>>>> ?keywords)
>>>>           (group_concat(distinct ?university;separator=", ") as
>>> ?universities)
>>>>           (group_concat(distinct ?country;separator=", ") as 
>>>> ?countries)
>>>>       WHERE {
>>>>           {?paper rdf:type iospress:Chapter.}
>>>>               union
>>>>           {?paper rdf:type iospress:Article.}
>>>>
>>>>           ?paper rdfs:label ?title;
>>>>                    rdf:type ?type;
>>>>
>>>>                    iospress:publicationDate ?pubDate;
>>>>                    iospress:publicationAbstract ?abstract;
>>>>
>>>>                    iospress:publicationIncludesKeyword ?keyword;
>>>>                    iospress:publicationAuthorList [?idx ?author].
>>>>
>>>>           ?issueOrBook iospress:partOf ?volumeOrSerie.
>>>>           ?paper iospress:partOf ?issueOrBook.
>>>>
>>>>
>>>>       OPTIONAL {
>>>>           ?issueOrBook iospress:isbn ?bookIsbn.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?paper iospress:publicationDoiUrl ?doi.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?author rdfs:label ?authorName.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?author iospress:contributorAffiliation ?affiliation.
>>>>           ?affiliation rdfs:label ?university;
>>>>       }
>>>>        OPTIONAL {
>>>>         ?affiliation iospress:geocodingOutput ?geocoded.
>>>>         ?geocoded iospress-geocode:country ?country
>>>>       }
>>>>       OPTIONAL {
>>>>           ?paper iospress:publicationAccessibility ?access.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?volumeOrSerie iospress:partOf ?journal;
>>>>       }
>>>>       FILTER(
>>>>           (
>>>>               (datatype(?pubDate) = xsd:date && 
>>>> xsd:dateTime(?pubDate) >
>>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
>>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
>>>>               (datatype(?pubDate) = xsd:gYear && ?pubDate >=
>>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
>>>>           )
>>>>
>>>>           && (regex (?keyword, "sickness", "i"))
>>>>           )
>>>>       }
>>>>       GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
>>>>
>>>>       ORDER BY ?pubDate ?paper
>>>>       LIMIT 50
>>>>
>>>>
>>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> Showing the query would be helpful but some general remarks:
>>>>>
>>>>> 1/ If the query or the setup for Fuseki is needing more than the 
>>>>> default
>>>>> heap size, then it might be that the Java JVM is getting into a 
>>>>> state of
>>>>> heap exhaustion. This manifests as the CPU loading getting very 
>>>>> high. It
>>>>> will seem like nothing is happening (waiting for response).
>>>>>
>>>>> 2/ The query may be expensive.
>>>>>
>>>>> Things to look for
>>>>> * cross products - two parts of the query pattern that are not
>>>>> connected.
>>>>>
>>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
>>>>>
>>>>> * sort, spilling to disk or combined with a cross product the query.
>>>>>
>>>>> 3/ If no results are coming back, then the query is form that does 
>>>>> not
>>>>> stream back - sort, or CONSTRUCT maybe.
>>>>>
>>>>> There was a useful presentation recently that talks about the 
>>>>> principles
>>>>> of query efficiency.
>>>>>
>>>>> SPARQL Query Optimization with Pavel Klinov
>>>>> https://www.youtube.com/watch?v=16eMswT2x2Y
>>>>>
>>>>> More inline:
>>>>>
>>>>> On 06/05/2021 09:54, Martin Van Aken wrote:
>>>>>> Hi!
>>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL 
>>>>>> world.
>>> I'm
>>>>>> currently building queries against a Fuseki/TDB backend (that I can
>>> work
>>>>> on
>>>>>> too) and I'm getting into significant performance problems 
>>>>>> (including
>>>>> never
>>>>>> ending queries).
>>>>>
>>>>> Are updates also happening at the same time?
>>>>>
>>>>>> Despite what I thought was a good search on the apache
>>>>>> jena website I could not find a lot of insight about performance
>>>>>> investigation so I'm trying it here.
>>>>>>
>>>>>> Most of my data experience comes from the relational world (ex: 
>>>>>> PG) so
>>>>> I'm
>>>>>> sometimes drawing comparisons there.
>>>>>>
>>>>>> To give some context my data set is around 15 linked concepts, 
>>>>>> with the
>>>>>> number of triples for each ranging from some hundreds to 500K - 
>>>>>> total
>>>>> less
>>>>>> than 2 millions (documents/authors/publication kind of data).
>>>>>>
>>>>>> Unto questions:
>>>>>>
>>>>>>       - When I'm facing a slow query, what are my investigation
>>> options. Is
>>>>>>       there an equivalent of an "explain plan" in SQL pointing to 
>>>>>> the
>>> query
>>>>>>       specific slow points? What's the advised way for performance
>>> checks
>>>>> in
>>>>>>       SPARQL?
>>>>>
>>>>> qparse --print=opt --file query.rq
>>>>>
>>>>>>       - Are there any performance setups to be aware of on the 
>>>>>> server
>>> side?
>>>>>>       Like ways to check indexes are correctly built (outside of 
>>>>>> text
>>>>> search that
>>>>>>       I'm not working with for the moment)
>>>>>>       - We're currently using TDB1. I've seen the transactional
>>> benefits of
>>>>>>       TDB2 - are there performance improvements too that would 
>>>>>> warrant a
>>>>>>       migration there ?
>>>>>
>>>>> Not on the query side.
>>>>>
>>>>>        Andy
>>>>>
>>>>>>
>>>>>> Thanks a lot already!
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martin Van Aken <ma...@joyouscoding.com>.
Great, thanks for the fast and precise answers, will try this.

Martin

On Tue, 25 May 2021 at 09:25, Martynas Jusevičius <ma...@atomgraph.com>
wrote:

> You get Turtle data because CONSTRUCT/DESCRIBE forms return a graph
> while ASK/SELECT return a tabular result set.
>
> You can try Accept: application/ld+json request header in order to get
> JSON-LD data: https://www.w3.org/TR/json-ld11/
>
> If you need a connected node in the description, you'll need to add a
> pattern that leads to it and then add the variable to DESCRIBE.
>
> On Tue, May 25, 2021 at 9:11 AM Martin Van Aken <ma...@joyouscoding.com>
> wrote:
> >
> > Hello again,
> > Thanks Steve & Lorenz - I'll have a look at nested optionals (did not
> > realize that was a thing).
> >
> > I've made tests with DESCRIBE and this seems to be the way to go - I've
> the
> > major performance improvement I needed (like 10x). This leaves me with
> two
> > more questions:
> >
> > - It seems that DESCRIBE always returns some kind of TTL format - is
> there
> > a hidden way to get JSON (like for a SELECT) query or is this by design?
> > It's not blocking but would mean some parsing of the results
> > - It seems DESCRIBE (in Jena, as I understood this is implementation
> > dependent) limited to the object itself (i.e. all objects linked to a
> > specific subject). This works for most of my needs, but I've some related
> > data I want to get too - what's the way there? Make a secondary query to
> > get those (ex: I'll get papers back, but papers are linked to authors
> that
> > are working in universities and I'd need those too)? If I do so and want
> to
> > avoid a "SELECT N+1" kind of problem (sending a secondary query per
> record)
> > is there some kind of "WHERE ?paper IN (..., ..., ...)" or do I just play
> > with OR clauses?
> >
> > Thanks again, this ML is having a huge impact on my knowledge & the
> linked
> > data project I'm working on, this is much appreciated.
> >
> > Martin
> >
> > On Thu, 20 May 2021 at 15:34, Steve Vestal <
> steve.vestal@adventiumlabs.com>
> > wrote:
> >
> > > Andy pointed at sequential OPTIONALs.  One example I have seen had
> > > nested OPTIONAL clauses to address a performance issue.  Might that be
> > > helpful here?
> > >
> > > On 5/20/2021 5:43 AM, Andy Seaborne wrote:
> > > >
> > > >
> > > > On 20/05/2021 09:36, Martin Van Aken wrote:
> > > >> Andy,
> > > >> A big thanks for this - it gives me some paths to explore. I think
> > > >> indeed
> > > >> my biggest problems are in the optional parts - I'll run the test
> you
> > > >> advised and also look in which case I may be able to get rid of the
> > > >> optionals to avoid those situations that could lead to a big amount
> of
> > > >> results as you mentioned. I'm already looking at getting my filters
> > > >> closer
> > > >> to definition - can this be done for things other than pure equality
> > > >> (for
> > > >> example for the date that are testing for a range?).
> > > >>
> > > >> Maybe one question about optional - I use them in some cases to
> avoid
> > > >> empty
> > > >> results. An example is Access - some paper have an Access triple
> > > >> (Open or
> > > >> Closed) - but some have none. My understanding is that if I make a
> link
> > > >> without optional like:
> > > >>
> > > >> ?paper iospress:accessibility ?access
> > > >
> > > > If it is just one triple in the optional is less likely to be bad but
> > > > if the query uses the variable unbound later on, there will be a very
> > > > large number of results, many duplicates and not actually related to
> > > > the ?paper. I am guessing but I would be surprised is your query has
> > > > variants of this and it is hidden by the "distinct".
> > > >
> > > > This is the problem at:
> > > >
> > > > >> ---
> > > > >>       OPTIONAL {
> > > > >>           ?author iospress:contributorAffiliation ?affiliation.
> > > > >>           ?affiliation rdfs:label ?university;
> > > > >>       }
> > > > >>        OPTIONAL {
> > > > >>         ?affiliation iospress:geocodingOutput ?geocoded.
> > > > >>         ?geocoded iospress-geocode:country ?country
> > > > >>       }
> > > > >> ---
> > > >
> > > > If no ?affiliation, then the second OPTIONAL is over the whole
> > > > database which I'm guess is many results.
> > > >
> > > >     Andy
> > > >
> > > >> this will de facto remove all papers without access from the set.
> > > >> This is
> > > >> something I don't want (I want them in the list, just with an empty
> > > >> value
> > > >> there) - and my understanding is that the way to manage this is an
> > > >> Optional. Is this correct? Is there a "better" way? If this ends up
> > > >> being
> > > >> costly, I could also check to actually have a value for those (those
> > > >> without value are technically "Closed").
> > > >>
> > > >> Something I was wondering also is whether it makes sense to split
> the
> > > >> fields I need for search/filtering vs the ones I want to see on the
> > > >> result.
> > > >> I've a feeling that in theory I could play with two queries - one
> > > >> with only
> > > >> the params I need for the filtering, then play something similar to
> > > >> DESCRIBE on each record on the filtered set - but I've no idea if
> this
> > > >> would be more performant than keeping it together as it is now.
> > > >>
> > > >> Anyway, the exchanges here are much appreciated!
> > > >>
> > > >> On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org>
> wrote:
> > > >>
> > > >>> Martin,
> > > >>>
> > > >>> That's a complicated query and I haven't got my head aroud it
> > > >>> completely
> > > >>> yet.
> > > >>>
> > > >>> There are some useful points to understand:
> > > >>>
> > > >>> A::
> > > >>>
> > > >>> What is the time and outcome of these queries that focus on the
> main
> > > >>> data location part:
> > > >>>
> > > >>> 1/
> > > >>>
> > > >>> SELECT (count(*) AS ?C) {
> > > >>>    ?paper  iospress:publicationDate ?pubDate
> > > >>>    FILTER(...date test...)
> > > >>> }
> > > >>>
> > > >>> 2/
> > > >>>    SELECT (count(*) AS ?C) {
> > > >>>    ?paper  iospress:publicationDate ?pubDate
> > > >>>            iospress:publicationIncludesKeyword ?keyword .
> > > >>>    FILETER (...date... && (regex (?keyword, "sickness", "i"))
> > > >>>
> > > >>> 3/
> > > >>> SELECT (count(*) AS ?C) {
> > > >>>     {?paper rdf:type iospress:Chapter.}
> > > >>>               union
> > > >>>     {?paper rdf:type iospress:Article.}
> > > >>>     ?paper  iospress:publicationDate ?pubDate
> > > >>>     FILTER(...date test))
> > > >>> }
> > > >>>
> > > >>> 4/
> > > >>> SELECT (count(*) AS ?C) {
> > > >>>    ?paper  iospress:publicationDate ?pubDate
> > > >>>    FILTER(.. date test...)
> > > >>>     {?paper rdf:type iospress:Chapter.}
> > > >>>               union
> > > >>>     {?paper rdf:type iospress:Article.}
> > > >>> }
> > > >>>
> > > >>> B::
> > > >>>
> > > >>> then is it the case that some optionals have more effect than
> others?
> > > >>> Some are "high risk":
> > > >>>
> > > >>> ---
> > > >>>       OPTIONAL {
> > > >>>           ?author iospress:contributorAffiliation ?affiliation.
> > > >>>           ?affiliation rdfs:label ?university;
> > > >>>       }
> > > >>>        OPTIONAL {
> > > >>>         ?affiliation iospress:geocodingOutput ?geocoded.
> > > >>>         ?geocoded iospress-geocode:country ?country
> > > >>>       }
> > > >>> ---
> > > >>> Suppose the first does not match then the second is a lot of
> results
> > > >>> unrelated to ?paper.
> > > >>>
> > > >>> C::
> > > >>>
> > > >>> distinct
> > > >>>
> > > >>> it might be worth trying without distinct because distinct can
> cause a
> > > >>> lot of results to be reduced to just a few, hiding redundant work.
> > > >>>
> > > >>>       Andy
> > > >>>
> > > >>> On 18/05/2021 13:31, Martin Van Aken wrote:
> > > >>>> Hello again,
> > > >>>> After some more days of me trying to get a better performance &
> the
> > > >>>> approval of my company, here is what we try to run (query at the
> > > >>>> bottom
> > > >>> of
> > > >>>> the mail).
> > > >>>>
> > > >>>> For some context:
> > > >>>>
> > > >>>> - This is a search for academia papers. Papers have multiple
> > > >>>> authors, and
> > > >>>> authors are part of multiple universities. Papers also have
> multiple
> > > >>>> keywords and are generally part of a set (an issue) itself part of
> > > >>>> a set
> > > >>> (a
> > > >>>> volume) itself part of a set (a journal).
> > > >>>> - Our goal is to have a multicriteria search front end, so the
> > > >>>> query is
> > > >>>> generated from a form with clauses selected by the user. The
> > > >>>> structure is
> > > >>>> always the same, this example use a single condition on the
> "keyword"
> > > >>>> - The set of data is relatively small - around 150k papers (so
> > > >>>> probably
> > > >>> 1M
> > > >>>> triples there), probably around 500k authors
> > > >>>> - We use group/concat as we want to give as results one line per
> paper
> > > >>> (vs
> > > >>>> having one per paper per keyword for example)
> > > >>>> - I've read OPTIONALS are pretty bad - but I've no real
> alternative
> > > >>>> here
> > > >>>> that I know off when some fields can be present or not and I don't
> > > >>>> want
> > > >>> to
> > > >>>> throw away all that miss at least one
> > > >>>>
> > > >>>> For our current results, all but the most precise queries (getting
> > > >>>> into a
> > > >>>> super limited set of papers, like <10) get extremely slow (easily
> to
> > > >>> dozens
> > > >>>> of seconds, sometimes more). I feel that there is something
> obvious
> > > >>>> that
> > > >>>> I'm missing, either in the query or my Jena config. The server is
> > > >>>> on an
> > > >>> old
> > > >>>> version but I make my tests locally on a 4.0.0 "out of the box" (0
> > > >>>> configuration).
> > > >>>>
> > > >>>> What I've tried:
> > > >>>>
> > > >>>> - Removing the ORDER does not impact much
> > > >>>> - Removing most optionals works... but remove the point of the
> query
> > > >>>> - Using contains instead of regex does not impact much (I've the
> > > >>>> goal to
> > > >>>> use Jena/Lucene integration for everything text related)
> > > >>>>
> > > >>>> I'm really in for an opinion as taking my RDBMS background this
> is the
> > > >>>> equivalent of less than 3M records split on around 8 tables -
> > > >>>> something
> > > >>>> that should be queryable mostly in sub second times.
> > > >>>>
> > > >>>> Any feedback is most welcome !
> > > >>>>
> > > >>>> Martin
> > > >>>>
> > > >>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> > > >>>>       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> > > >>>>       PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> > > >>>>       PREFIX iospress-geocode: <
> http://ld.iospress.nl/rdf/geocode/>
> > > >>>>       PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> > > >>>>       PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> > > >>>>
> > > >>>>       SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> > > >>>>           (group_concat(distinct ?authorName;separator=", ") as
> > > >>>> ?Authors)
> > > >>>>           (group_concat(distinct ?keyword;separator=", ") as
> > > >>>> ?keywords)
> > > >>>>           (group_concat(distinct ?university;separator=", ") as
> > > >>> ?universities)
> > > >>>>           (group_concat(distinct ?country;separator=", ") as
> > > >>>> ?countries)
> > > >>>>       WHERE {
> > > >>>>           {?paper rdf:type iospress:Chapter.}
> > > >>>>               union
> > > >>>>           {?paper rdf:type iospress:Article.}
> > > >>>>
> > > >>>>           ?paper rdfs:label ?title;
> > > >>>>                    rdf:type ?type;
> > > >>>>
> > > >>>>                    iospress:publicationDate ?pubDate;
> > > >>>>                    iospress:publicationAbstract ?abstract;
> > > >>>>
> > > >>>>                    iospress:publicationIncludesKeyword ?keyword;
> > > >>>>                    iospress:publicationAuthorList [?idx ?author].
> > > >>>>
> > > >>>>           ?issueOrBook iospress:partOf ?volumeOrSerie.
> > > >>>>           ?paper iospress:partOf ?issueOrBook.
> > > >>>>
> > > >>>>
> > > >>>>       OPTIONAL {
> > > >>>>           ?issueOrBook iospress:isbn ?bookIsbn.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?paper iospress:publicationDoiUrl ?doi.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?author rdfs:label ?authorName.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?author iospress:contributorAffiliation ?affiliation.
> > > >>>>           ?affiliation rdfs:label ?university;
> > > >>>>       }
> > > >>>>        OPTIONAL {
> > > >>>>         ?affiliation iospress:geocodingOutput ?geocoded.
> > > >>>>         ?geocoded iospress-geocode:country ?country
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?paper iospress:publicationAccessibility ?access.
> > > >>>>       }
> > > >>>>       OPTIONAL {
> > > >>>>           ?volumeOrSerie iospress:partOf ?journal;
> > > >>>>       }
> > > >>>>       FILTER(
> > > >>>>           (
> > > >>>>               (datatype(?pubDate) = xsd:date &&
> > > >>>> xsd:dateTime(?pubDate) >
> > > >>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime &&
> xsd:dateTime(?pubDate) <
> > > >>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> > > >>>>               (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> > > >>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> > > >>>>           )
> > > >>>>
> > > >>>>           && (regex (?keyword, "sickness", "i"))
> > > >>>>           )
> > > >>>>       }
> > > >>>>       GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> > > >>>>
> > > >>>>       ORDER BY ?pubDate ?paper
> > > >>>>       LIMIT 50
> > > >>>>
> > > >>>>
> > > >>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org>
> wrote:
> > > >>>>
> > > >>>>> Hi there,
> > > >>>>>
> > > >>>>> Showing the query would be helpful but some general remarks:
> > > >>>>>
> > > >>>>> 1/ If the query or the setup for Fuseki is needing more than the
> > > >>>>> default
> > > >>>>> heap size, then it might be that the Java JVM is getting into a
> > > >>>>> state of
> > > >>>>> heap exhaustion. This manifests as the CPU loading getting very
> > > >>>>> high. It
> > > >>>>> will seem like nothing is happening (waiting for response).
> > > >>>>>
> > > >>>>> 2/ The query may be expensive.
> > > >>>>>
> > > >>>>> Things to look for
> > > >>>>> * cross products - two parts of the query pattern that are not
> > > >>>>> connected.
> > > >>>>>
> > > >>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> > > >>>>>
> > > >>>>> * sort, spilling to disk or combined with a cross product the
> query.
> > > >>>>>
> > > >>>>> 3/ If no results are coming back, then the query is form that
> does
> > > >>>>> not
> > > >>>>> stream back - sort, or CONSTRUCT maybe.
> > > >>>>>
> > > >>>>> There was a useful presentation recently that talks about the
> > > >>>>> principles
> > > >>>>> of query efficiency.
> > > >>>>>
> > > >>>>> SPARQL Query Optimization with Pavel Klinov
> > > >>>>> https://www.youtube.com/watch?v=16eMswT2x2Y
> > > >>>>>
> > > >>>>> More inline:
> > > >>>>>
> > > >>>>> On 06/05/2021 09:54, Martin Van Aken wrote:
> > > >>>>>> Hi!
> > > >>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL
> > > >>>>>> world.
> > > >>> I'm
> > > >>>>>> currently building queries against a Fuseki/TDB backend (that I
> can
> > > >>> work
> > > >>>>> on
> > > >>>>>> too) and I'm getting into significant performance problems
> > > >>>>>> (including
> > > >>>>> never
> > > >>>>>> ending queries).
> > > >>>>>
> > > >>>>> Are updates also happening at the same time?
> > > >>>>>
> > > >>>>>> Despite what I thought was a good search on the apache
> > > >>>>>> jena website I could not find a lot of insight about performance
> > > >>>>>> investigation so I'm trying it here.
> > > >>>>>>
> > > >>>>>> Most of my data experience comes from the relational world (ex:
> > > >>>>>> PG) so
> > > >>>>> I'm
> > > >>>>>> sometimes drawing comparisons there.
> > > >>>>>>
> > > >>>>>> To give some context my data set is around 15 linked concepts,
> > > >>>>>> with the
> > > >>>>>> number of triples for each ranging from some hundreds to 500K -
> > > >>>>>> total
> > > >>>>> less
> > > >>>>>> than 2 millions (documents/authors/publication kind of data).
> > > >>>>>>
> > > >>>>>> Unto questions:
> > > >>>>>>
> > > >>>>>>       - When I'm facing a slow query, what are my investigation
> > > >>> options. Is
> > > >>>>>>       there an equivalent of an "explain plan" in SQL pointing
> to
> > > >>>>>> the
> > > >>> query
> > > >>>>>>       specific slow points? What's the advised way for
> performance
> > > >>> checks
> > > >>>>> in
> > > >>>>>>       SPARQL?
> > > >>>>>
> > > >>>>> qparse --print=opt --file query.rq
> > > >>>>>
> > > >>>>>>       - Are there any performance setups to be aware of on the
> > > >>>>>> server
> > > >>> side?
> > > >>>>>>       Like ways to check indexes are correctly built (outside of
> > > >>>>>> text
> > > >>>>> search that
> > > >>>>>>       I'm not working with for the moment)
> > > >>>>>>       - We're currently using TDB1. I've seen the transactional
> > > >>> benefits of
> > > >>>>>>       TDB2 - are there performance improvements too that would
> > > >>>>>> warrant a
> > > >>>>>>       migration there ?
> > > >>>>>
> > > >>>>> Not on the query side.
> > > >>>>>
> > > >>>>>        Andy
> > > >>>>>
> > > >>>>>>
> > > >>>>>> Thanks a lot already!
> > > >>>>>>
> > > >>>>>> Martin
> > > >>>>>>
> > > >>>>>
> > > >>>>
> > > >>>>
> > > >>>
> > > >>
> > > >>
> > >
> > >
> >
> > --
> > *Martin Van Aken - **Freelance Enthusiast Developer*
> >
> > Mobile : +32 486 899 652
> >
> > Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
> > Call me on Skype : vanakenm
> > Hang out with me : martin@joyouscoding.com
> > Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
> > Company website : www.joyouscoding.com
>


-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : martin@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martynas Jusevičius <ma...@atomgraph.com>.
You get Turtle data because CONSTRUCT/DESCRIBE forms return a graph
while ASK/SELECT return a tabular result set.

You can try Accept: application/ld+json request header in order to get
JSON-LD data: https://www.w3.org/TR/json-ld11/

If you need a connected node in the description, you'll need to add a
pattern that leads to it and then add the variable to DESCRIBE.

On Tue, May 25, 2021 at 9:11 AM Martin Van Aken <ma...@joyouscoding.com> wrote:
>
> Hello again,
> Thanks Steve & Lorenz - I'll have a look at nested optionals (did not
> realize that was a thing).
>
> I've made tests with DESCRIBE and this seems to be the way to go - I've the
> major performance improvement I needed (like 10x). This leaves me with two
> more questions:
>
> - It seems that DESCRIBE always returns some kind of TTL format - is there
> a hidden way to get JSON (like for a SELECT) query or is this by design?
> It's not blocking but would mean some parsing of the results
> - It seems DESCRIBE (in Jena, as I understood this is implementation
> dependent) limited to the object itself (i.e. all objects linked to a
> specific subject). This works for most of my needs, but I've some related
> data I want to get too - what's the way there? Make a secondary query to
> get those (ex: I'll get papers back, but papers are linked to authors that
> are working in universities and I'd need those too)? If I do so and want to
> avoid a "SELECT N+1" kind of problem (sending a secondary query per record)
> is there some kind of "WHERE ?paper IN (..., ..., ...)" or do I just play
> with OR clauses?
>
> Thanks again, this ML is having a huge impact on my knowledge & the linked
> data project I'm working on, this is much appreciated.
>
> Martin
>
> On Thu, 20 May 2021 at 15:34, Steve Vestal <st...@adventiumlabs.com>
> wrote:
>
> > Andy pointed at sequential OPTIONALs.  One example I have seen had
> > nested OPTIONAL clauses to address a performance issue.  Might that be
> > helpful here?
> >
> > On 5/20/2021 5:43 AM, Andy Seaborne wrote:
> > >
> > >
> > > On 20/05/2021 09:36, Martin Van Aken wrote:
> > >> Andy,
> > >> A big thanks for this - it gives me some paths to explore. I think
> > >> indeed
> > >> my biggest problems are in the optional parts - I'll run the test you
> > >> advised and also look in which case I may be able to get rid of the
> > >> optionals to avoid those situations that could lead to a big amount of
> > >> results as you mentioned. I'm already looking at getting my filters
> > >> closer
> > >> to definition - can this be done for things other than pure equality
> > >> (for
> > >> example for the date that are testing for a range?).
> > >>
> > >> Maybe one question about optional - I use them in some cases to avoid
> > >> empty
> > >> results. An example is Access - some paper have an Access triple
> > >> (Open or
> > >> Closed) - but some have none. My understanding is that if I make a link
> > >> without optional like:
> > >>
> > >> ?paper iospress:accessibility ?access
> > >
> > > If it is just one triple in the optional is less likely to be bad but
> > > if the query uses the variable unbound later on, there will be a very
> > > large number of results, many duplicates and not actually related to
> > > the ?paper. I am guessing but I would be surprised is your query has
> > > variants of this and it is hidden by the "distinct".
> > >
> > > This is the problem at:
> > >
> > > >> ---
> > > >>       OPTIONAL {
> > > >>           ?author iospress:contributorAffiliation ?affiliation.
> > > >>           ?affiliation rdfs:label ?university;
> > > >>       }
> > > >>        OPTIONAL {
> > > >>         ?affiliation iospress:geocodingOutput ?geocoded.
> > > >>         ?geocoded iospress-geocode:country ?country
> > > >>       }
> > > >> ---
> > >
> > > If no ?affiliation, then the second OPTIONAL is over the whole
> > > database which I'm guess is many results.
> > >
> > >     Andy
> > >
> > >> this will de facto remove all papers without access from the set.
> > >> This is
> > >> something I don't want (I want them in the list, just with an empty
> > >> value
> > >> there) - and my understanding is that the way to manage this is an
> > >> Optional. Is this correct? Is there a "better" way? If this ends up
> > >> being
> > >> costly, I could also check to actually have a value for those (those
> > >> without value are technically "Closed").
> > >>
> > >> Something I was wondering also is whether it makes sense to split the
> > >> fields I need for search/filtering vs the ones I want to see on the
> > >> result.
> > >> I've a feeling that in theory I could play with two queries - one
> > >> with only
> > >> the params I need for the filtering, then play something similar to
> > >> DESCRIBE on each record on the filtered set - but I've no idea if this
> > >> would be more performant than keeping it together as it is now.
> > >>
> > >> Anyway, the exchanges here are much appreciated!
> > >>
> > >> On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:
> > >>
> > >>> Martin,
> > >>>
> > >>> That's a complicated query and I haven't got my head aroud it
> > >>> completely
> > >>> yet.
> > >>>
> > >>> There are some useful points to understand:
> > >>>
> > >>> A::
> > >>>
> > >>> What is the time and outcome of these queries that focus on the main
> > >>> data location part:
> > >>>
> > >>> 1/
> > >>>
> > >>> SELECT (count(*) AS ?C) {
> > >>>    ?paper  iospress:publicationDate ?pubDate
> > >>>    FILTER(...date test...)
> > >>> }
> > >>>
> > >>> 2/
> > >>>    SELECT (count(*) AS ?C) {
> > >>>    ?paper  iospress:publicationDate ?pubDate
> > >>>            iospress:publicationIncludesKeyword ?keyword .
> > >>>    FILETER (...date... && (regex (?keyword, "sickness", "i"))
> > >>>
> > >>> 3/
> > >>> SELECT (count(*) AS ?C) {
> > >>>     {?paper rdf:type iospress:Chapter.}
> > >>>               union
> > >>>     {?paper rdf:type iospress:Article.}
> > >>>     ?paper  iospress:publicationDate ?pubDate
> > >>>     FILTER(...date test))
> > >>> }
> > >>>
> > >>> 4/
> > >>> SELECT (count(*) AS ?C) {
> > >>>    ?paper  iospress:publicationDate ?pubDate
> > >>>    FILTER(.. date test...)
> > >>>     {?paper rdf:type iospress:Chapter.}
> > >>>               union
> > >>>     {?paper rdf:type iospress:Article.}
> > >>> }
> > >>>
> > >>> B::
> > >>>
> > >>> then is it the case that some optionals have more effect than others?
> > >>> Some are "high risk":
> > >>>
> > >>> ---
> > >>>       OPTIONAL {
> > >>>           ?author iospress:contributorAffiliation ?affiliation.
> > >>>           ?affiliation rdfs:label ?university;
> > >>>       }
> > >>>        OPTIONAL {
> > >>>         ?affiliation iospress:geocodingOutput ?geocoded.
> > >>>         ?geocoded iospress-geocode:country ?country
> > >>>       }
> > >>> ---
> > >>> Suppose the first does not match then the second is a lot of results
> > >>> unrelated to ?paper.
> > >>>
> > >>> C::
> > >>>
> > >>> distinct
> > >>>
> > >>> it might be worth trying without distinct because distinct can cause a
> > >>> lot of results to be reduced to just a few, hiding redundant work.
> > >>>
> > >>>       Andy
> > >>>
> > >>> On 18/05/2021 13:31, Martin Van Aken wrote:
> > >>>> Hello again,
> > >>>> After some more days of me trying to get a better performance & the
> > >>>> approval of my company, here is what we try to run (query at the
> > >>>> bottom
> > >>> of
> > >>>> the mail).
> > >>>>
> > >>>> For some context:
> > >>>>
> > >>>> - This is a search for academia papers. Papers have multiple
> > >>>> authors, and
> > >>>> authors are part of multiple universities. Papers also have multiple
> > >>>> keywords and are generally part of a set (an issue) itself part of
> > >>>> a set
> > >>> (a
> > >>>> volume) itself part of a set (a journal).
> > >>>> - Our goal is to have a multicriteria search front end, so the
> > >>>> query is
> > >>>> generated from a form with clauses selected by the user. The
> > >>>> structure is
> > >>>> always the same, this example use a single condition on the "keyword"
> > >>>> - The set of data is relatively small - around 150k papers (so
> > >>>> probably
> > >>> 1M
> > >>>> triples there), probably around 500k authors
> > >>>> - We use group/concat as we want to give as results one line per paper
> > >>> (vs
> > >>>> having one per paper per keyword for example)
> > >>>> - I've read OPTIONALS are pretty bad - but I've no real alternative
> > >>>> here
> > >>>> that I know off when some fields can be present or not and I don't
> > >>>> want
> > >>> to
> > >>>> throw away all that miss at least one
> > >>>>
> > >>>> For our current results, all but the most precise queries (getting
> > >>>> into a
> > >>>> super limited set of papers, like <10) get extremely slow (easily to
> > >>> dozens
> > >>>> of seconds, sometimes more). I feel that there is something obvious
> > >>>> that
> > >>>> I'm missing, either in the query or my Jena config. The server is
> > >>>> on an
> > >>> old
> > >>>> version but I make my tests locally on a 4.0.0 "out of the box" (0
> > >>>> configuration).
> > >>>>
> > >>>> What I've tried:
> > >>>>
> > >>>> - Removing the ORDER does not impact much
> > >>>> - Removing most optionals works... but remove the point of the query
> > >>>> - Using contains instead of regex does not impact much (I've the
> > >>>> goal to
> > >>>> use Jena/Lucene integration for everything text related)
> > >>>>
> > >>>> I'm really in for an opinion as taking my RDBMS background this is the
> > >>>> equivalent of less than 3M records split on around 8 tables -
> > >>>> something
> > >>>> that should be queryable mostly in sub second times.
> > >>>>
> > >>>> Any feedback is most welcome !
> > >>>>
> > >>>> Martin
> > >>>>
> > >>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> > >>>>       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> > >>>>       PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> > >>>>       PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
> > >>>>       PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> > >>>>       PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> > >>>>
> > >>>>       SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> > >>>>           (group_concat(distinct ?authorName;separator=", ") as
> > >>>> ?Authors)
> > >>>>           (group_concat(distinct ?keyword;separator=", ") as
> > >>>> ?keywords)
> > >>>>           (group_concat(distinct ?university;separator=", ") as
> > >>> ?universities)
> > >>>>           (group_concat(distinct ?country;separator=", ") as
> > >>>> ?countries)
> > >>>>       WHERE {
> > >>>>           {?paper rdf:type iospress:Chapter.}
> > >>>>               union
> > >>>>           {?paper rdf:type iospress:Article.}
> > >>>>
> > >>>>           ?paper rdfs:label ?title;
> > >>>>                    rdf:type ?type;
> > >>>>
> > >>>>                    iospress:publicationDate ?pubDate;
> > >>>>                    iospress:publicationAbstract ?abstract;
> > >>>>
> > >>>>                    iospress:publicationIncludesKeyword ?keyword;
> > >>>>                    iospress:publicationAuthorList [?idx ?author].
> > >>>>
> > >>>>           ?issueOrBook iospress:partOf ?volumeOrSerie.
> > >>>>           ?paper iospress:partOf ?issueOrBook.
> > >>>>
> > >>>>
> > >>>>       OPTIONAL {
> > >>>>           ?issueOrBook iospress:isbn ?bookIsbn.
> > >>>>       }
> > >>>>       OPTIONAL {
> > >>>>           ?paper iospress:publicationDoiUrl ?doi.
> > >>>>       }
> > >>>>       OPTIONAL {
> > >>>>           ?author rdfs:label ?authorName.
> > >>>>       }
> > >>>>       OPTIONAL {
> > >>>>           ?author iospress:contributorAffiliation ?affiliation.
> > >>>>           ?affiliation rdfs:label ?university;
> > >>>>       }
> > >>>>        OPTIONAL {
> > >>>>         ?affiliation iospress:geocodingOutput ?geocoded.
> > >>>>         ?geocoded iospress-geocode:country ?country
> > >>>>       }
> > >>>>       OPTIONAL {
> > >>>>           ?paper iospress:publicationAccessibility ?access.
> > >>>>       }
> > >>>>       OPTIONAL {
> > >>>>           ?volumeOrSerie iospress:partOf ?journal;
> > >>>>       }
> > >>>>       FILTER(
> > >>>>           (
> > >>>>               (datatype(?pubDate) = xsd:date &&
> > >>>> xsd:dateTime(?pubDate) >
> > >>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
> > >>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> > >>>>               (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> > >>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> > >>>>           )
> > >>>>
> > >>>>           && (regex (?keyword, "sickness", "i"))
> > >>>>           )
> > >>>>       }
> > >>>>       GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> > >>>>
> > >>>>       ORDER BY ?pubDate ?paper
> > >>>>       LIMIT 50
> > >>>>
> > >>>>
> > >>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
> > >>>>
> > >>>>> Hi there,
> > >>>>>
> > >>>>> Showing the query would be helpful but some general remarks:
> > >>>>>
> > >>>>> 1/ If the query or the setup for Fuseki is needing more than the
> > >>>>> default
> > >>>>> heap size, then it might be that the Java JVM is getting into a
> > >>>>> state of
> > >>>>> heap exhaustion. This manifests as the CPU loading getting very
> > >>>>> high. It
> > >>>>> will seem like nothing is happening (waiting for response).
> > >>>>>
> > >>>>> 2/ The query may be expensive.
> > >>>>>
> > >>>>> Things to look for
> > >>>>> * cross products - two parts of the query pattern that are not
> > >>>>> connected.
> > >>>>>
> > >>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> > >>>>>
> > >>>>> * sort, spilling to disk or combined with a cross product the query.
> > >>>>>
> > >>>>> 3/ If no results are coming back, then the query is form that does
> > >>>>> not
> > >>>>> stream back - sort, or CONSTRUCT maybe.
> > >>>>>
> > >>>>> There was a useful presentation recently that talks about the
> > >>>>> principles
> > >>>>> of query efficiency.
> > >>>>>
> > >>>>> SPARQL Query Optimization with Pavel Klinov
> > >>>>> https://www.youtube.com/watch?v=16eMswT2x2Y
> > >>>>>
> > >>>>> More inline:
> > >>>>>
> > >>>>> On 06/05/2021 09:54, Martin Van Aken wrote:
> > >>>>>> Hi!
> > >>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL
> > >>>>>> world.
> > >>> I'm
> > >>>>>> currently building queries against a Fuseki/TDB backend (that I can
> > >>> work
> > >>>>> on
> > >>>>>> too) and I'm getting into significant performance problems
> > >>>>>> (including
> > >>>>> never
> > >>>>>> ending queries).
> > >>>>>
> > >>>>> Are updates also happening at the same time?
> > >>>>>
> > >>>>>> Despite what I thought was a good search on the apache
> > >>>>>> jena website I could not find a lot of insight about performance
> > >>>>>> investigation so I'm trying it here.
> > >>>>>>
> > >>>>>> Most of my data experience comes from the relational world (ex:
> > >>>>>> PG) so
> > >>>>> I'm
> > >>>>>> sometimes drawing comparisons there.
> > >>>>>>
> > >>>>>> To give some context my data set is around 15 linked concepts,
> > >>>>>> with the
> > >>>>>> number of triples for each ranging from some hundreds to 500K -
> > >>>>>> total
> > >>>>> less
> > >>>>>> than 2 millions (documents/authors/publication kind of data).
> > >>>>>>
> > >>>>>> Unto questions:
> > >>>>>>
> > >>>>>>       - When I'm facing a slow query, what are my investigation
> > >>> options. Is
> > >>>>>>       there an equivalent of an "explain plan" in SQL pointing to
> > >>>>>> the
> > >>> query
> > >>>>>>       specific slow points? What's the advised way for performance
> > >>> checks
> > >>>>> in
> > >>>>>>       SPARQL?
> > >>>>>
> > >>>>> qparse --print=opt --file query.rq
> > >>>>>
> > >>>>>>       - Are there any performance setups to be aware of on the
> > >>>>>> server
> > >>> side?
> > >>>>>>       Like ways to check indexes are correctly built (outside of
> > >>>>>> text
> > >>>>> search that
> > >>>>>>       I'm not working with for the moment)
> > >>>>>>       - We're currently using TDB1. I've seen the transactional
> > >>> benefits of
> > >>>>>>       TDB2 - are there performance improvements too that would
> > >>>>>> warrant a
> > >>>>>>       migration there ?
> > >>>>>
> > >>>>> Not on the query side.
> > >>>>>
> > >>>>>        Andy
> > >>>>>
> > >>>>>>
> > >>>>>> Thanks a lot already!
> > >>>>>>
> > >>>>>> Martin
> > >>>>>>
> > >>>>>
> > >>>>
> > >>>>
> > >>>
> > >>
> > >>
> >
> >
>
> --
> *Martin Van Aken - **Freelance Enthusiast Developer*
>
> Mobile : +32 486 899 652
>
> Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
> Call me on Skype : vanakenm
> Hang out with me : martin@joyouscoding.com
> Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
> Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martin Van Aken <ma...@joyouscoding.com>.
Hello again,
Thanks Steve & Lorenz - I'll have a look at nested optionals (did not
realize that was a thing).

I've made tests with DESCRIBE and this seems to be the way to go - I've the
major performance improvement I needed (like 10x). This leaves me with two
more questions:

- It seems that DESCRIBE always returns some kind of TTL format - is there
a hidden way to get JSON (like for a SELECT) query or is this by design?
It's not blocking but would mean some parsing of the results
- It seems DESCRIBE (in Jena, as I understood this is implementation
dependent) limited to the object itself (i.e. all objects linked to a
specific subject). This works for most of my needs, but I've some related
data I want to get too - what's the way there? Make a secondary query to
get those (ex: I'll get papers back, but papers are linked to authors that
are working in universities and I'd need those too)? If I do so and want to
avoid a "SELECT N+1" kind of problem (sending a secondary query per record)
is there some kind of "WHERE ?paper IN (..., ..., ...)" or do I just play
with OR clauses?

Thanks again, this ML is having a huge impact on my knowledge & the linked
data project I'm working on, this is much appreciated.

Martin

On Thu, 20 May 2021 at 15:34, Steve Vestal <st...@adventiumlabs.com>
wrote:

> Andy pointed at sequential OPTIONALs.  One example I have seen had
> nested OPTIONAL clauses to address a performance issue.  Might that be
> helpful here?
>
> On 5/20/2021 5:43 AM, Andy Seaborne wrote:
> >
> >
> > On 20/05/2021 09:36, Martin Van Aken wrote:
> >> Andy,
> >> A big thanks for this - it gives me some paths to explore. I think
> >> indeed
> >> my biggest problems are in the optional parts - I'll run the test you
> >> advised and also look in which case I may be able to get rid of the
> >> optionals to avoid those situations that could lead to a big amount of
> >> results as you mentioned. I'm already looking at getting my filters
> >> closer
> >> to definition - can this be done for things other than pure equality
> >> (for
> >> example for the date that are testing for a range?).
> >>
> >> Maybe one question about optional - I use them in some cases to avoid
> >> empty
> >> results. An example is Access - some paper have an Access triple
> >> (Open or
> >> Closed) - but some have none. My understanding is that if I make a link
> >> without optional like:
> >>
> >> ?paper iospress:accessibility ?access
> >
> > If it is just one triple in the optional is less likely to be bad but
> > if the query uses the variable unbound later on, there will be a very
> > large number of results, many duplicates and not actually related to
> > the ?paper. I am guessing but I would be surprised is your query has
> > variants of this and it is hidden by the "distinct".
> >
> > This is the problem at:
> >
> > >> ---
> > >>       OPTIONAL {
> > >>           ?author iospress:contributorAffiliation ?affiliation.
> > >>           ?affiliation rdfs:label ?university;
> > >>       }
> > >>        OPTIONAL {
> > >>         ?affiliation iospress:geocodingOutput ?geocoded.
> > >>         ?geocoded iospress-geocode:country ?country
> > >>       }
> > >> ---
> >
> > If no ?affiliation, then the second OPTIONAL is over the whole
> > database which I'm guess is many results.
> >
> >     Andy
> >
> >> this will de facto remove all papers without access from the set.
> >> This is
> >> something I don't want (I want them in the list, just with an empty
> >> value
> >> there) - and my understanding is that the way to manage this is an
> >> Optional. Is this correct? Is there a "better" way? If this ends up
> >> being
> >> costly, I could also check to actually have a value for those (those
> >> without value are technically "Closed").
> >>
> >> Something I was wondering also is whether it makes sense to split the
> >> fields I need for search/filtering vs the ones I want to see on the
> >> result.
> >> I've a feeling that in theory I could play with two queries - one
> >> with only
> >> the params I need for the filtering, then play something similar to
> >> DESCRIBE on each record on the filtered set - but I've no idea if this
> >> would be more performant than keeping it together as it is now.
> >>
> >> Anyway, the exchanges here are much appreciated!
> >>
> >> On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:
> >>
> >>> Martin,
> >>>
> >>> That's a complicated query and I haven't got my head aroud it
> >>> completely
> >>> yet.
> >>>
> >>> There are some useful points to understand:
> >>>
> >>> A::
> >>>
> >>> What is the time and outcome of these queries that focus on the main
> >>> data location part:
> >>>
> >>> 1/
> >>>
> >>> SELECT (count(*) AS ?C) {
> >>>    ?paper  iospress:publicationDate ?pubDate
> >>>    FILTER(...date test...)
> >>> }
> >>>
> >>> 2/
> >>>    SELECT (count(*) AS ?C) {
> >>>    ?paper  iospress:publicationDate ?pubDate
> >>>            iospress:publicationIncludesKeyword ?keyword .
> >>>    FILETER (...date... && (regex (?keyword, "sickness", "i"))
> >>>
> >>> 3/
> >>> SELECT (count(*) AS ?C) {
> >>>     {?paper rdf:type iospress:Chapter.}
> >>>               union
> >>>     {?paper rdf:type iospress:Article.}
> >>>     ?paper  iospress:publicationDate ?pubDate
> >>>     FILTER(...date test))
> >>> }
> >>>
> >>> 4/
> >>> SELECT (count(*) AS ?C) {
> >>>    ?paper  iospress:publicationDate ?pubDate
> >>>    FILTER(.. date test...)
> >>>     {?paper rdf:type iospress:Chapter.}
> >>>               union
> >>>     {?paper rdf:type iospress:Article.}
> >>> }
> >>>
> >>> B::
> >>>
> >>> then is it the case that some optionals have more effect than others?
> >>> Some are "high risk":
> >>>
> >>> ---
> >>>       OPTIONAL {
> >>>           ?author iospress:contributorAffiliation ?affiliation.
> >>>           ?affiliation rdfs:label ?university;
> >>>       }
> >>>        OPTIONAL {
> >>>         ?affiliation iospress:geocodingOutput ?geocoded.
> >>>         ?geocoded iospress-geocode:country ?country
> >>>       }
> >>> ---
> >>> Suppose the first does not match then the second is a lot of results
> >>> unrelated to ?paper.
> >>>
> >>> C::
> >>>
> >>> distinct
> >>>
> >>> it might be worth trying without distinct because distinct can cause a
> >>> lot of results to be reduced to just a few, hiding redundant work.
> >>>
> >>>       Andy
> >>>
> >>> On 18/05/2021 13:31, Martin Van Aken wrote:
> >>>> Hello again,
> >>>> After some more days of me trying to get a better performance & the
> >>>> approval of my company, here is what we try to run (query at the
> >>>> bottom
> >>> of
> >>>> the mail).
> >>>>
> >>>> For some context:
> >>>>
> >>>> - This is a search for academia papers. Papers have multiple
> >>>> authors, and
> >>>> authors are part of multiple universities. Papers also have multiple
> >>>> keywords and are generally part of a set (an issue) itself part of
> >>>> a set
> >>> (a
> >>>> volume) itself part of a set (a journal).
> >>>> - Our goal is to have a multicriteria search front end, so the
> >>>> query is
> >>>> generated from a form with clauses selected by the user. The
> >>>> structure is
> >>>> always the same, this example use a single condition on the "keyword"
> >>>> - The set of data is relatively small - around 150k papers (so
> >>>> probably
> >>> 1M
> >>>> triples there), probably around 500k authors
> >>>> - We use group/concat as we want to give as results one line per paper
> >>> (vs
> >>>> having one per paper per keyword for example)
> >>>> - I've read OPTIONALS are pretty bad - but I've no real alternative
> >>>> here
> >>>> that I know off when some fields can be present or not and I don't
> >>>> want
> >>> to
> >>>> throw away all that miss at least one
> >>>>
> >>>> For our current results, all but the most precise queries (getting
> >>>> into a
> >>>> super limited set of papers, like <10) get extremely slow (easily to
> >>> dozens
> >>>> of seconds, sometimes more). I feel that there is something obvious
> >>>> that
> >>>> I'm missing, either in the query or my Jena config. The server is
> >>>> on an
> >>> old
> >>>> version but I make my tests locally on a 4.0.0 "out of the box" (0
> >>>> configuration).
> >>>>
> >>>> What I've tried:
> >>>>
> >>>> - Removing the ORDER does not impact much
> >>>> - Removing most optionals works... but remove the point of the query
> >>>> - Using contains instead of regex does not impact much (I've the
> >>>> goal to
> >>>> use Jena/Lucene integration for everything text related)
> >>>>
> >>>> I'm really in for an opinion as taking my RDBMS background this is the
> >>>> equivalent of less than 3M records split on around 8 tables -
> >>>> something
> >>>> that should be queryable mostly in sub second times.
> >>>>
> >>>> Any feedback is most welcome !
> >>>>
> >>>> Martin
> >>>>
> >>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >>>>       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> >>>>       PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> >>>>       PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
> >>>>       PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> >>>>       PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> >>>>
> >>>>       SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> >>>>           (group_concat(distinct ?authorName;separator=", ") as
> >>>> ?Authors)
> >>>>           (group_concat(distinct ?keyword;separator=", ") as
> >>>> ?keywords)
> >>>>           (group_concat(distinct ?university;separator=", ") as
> >>> ?universities)
> >>>>           (group_concat(distinct ?country;separator=", ") as
> >>>> ?countries)
> >>>>       WHERE {
> >>>>           {?paper rdf:type iospress:Chapter.}
> >>>>               union
> >>>>           {?paper rdf:type iospress:Article.}
> >>>>
> >>>>           ?paper rdfs:label ?title;
> >>>>                    rdf:type ?type;
> >>>>
> >>>>                    iospress:publicationDate ?pubDate;
> >>>>                    iospress:publicationAbstract ?abstract;
> >>>>
> >>>>                    iospress:publicationIncludesKeyword ?keyword;
> >>>>                    iospress:publicationAuthorList [?idx ?author].
> >>>>
> >>>>           ?issueOrBook iospress:partOf ?volumeOrSerie.
> >>>>           ?paper iospress:partOf ?issueOrBook.
> >>>>
> >>>>
> >>>>       OPTIONAL {
> >>>>           ?issueOrBook iospress:isbn ?bookIsbn.
> >>>>       }
> >>>>       OPTIONAL {
> >>>>           ?paper iospress:publicationDoiUrl ?doi.
> >>>>       }
> >>>>       OPTIONAL {
> >>>>           ?author rdfs:label ?authorName.
> >>>>       }
> >>>>       OPTIONAL {
> >>>>           ?author iospress:contributorAffiliation ?affiliation.
> >>>>           ?affiliation rdfs:label ?university;
> >>>>       }
> >>>>        OPTIONAL {
> >>>>         ?affiliation iospress:geocodingOutput ?geocoded.
> >>>>         ?geocoded iospress-geocode:country ?country
> >>>>       }
> >>>>       OPTIONAL {
> >>>>           ?paper iospress:publicationAccessibility ?access.
> >>>>       }
> >>>>       OPTIONAL {
> >>>>           ?volumeOrSerie iospress:partOf ?journal;
> >>>>       }
> >>>>       FILTER(
> >>>>           (
> >>>>               (datatype(?pubDate) = xsd:date &&
> >>>> xsd:dateTime(?pubDate) >
> >>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
> >>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> >>>>               (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> >>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> >>>>           )
> >>>>
> >>>>           && (regex (?keyword, "sickness", "i"))
> >>>>           )
> >>>>       }
> >>>>       GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> >>>>
> >>>>       ORDER BY ?pubDate ?paper
> >>>>       LIMIT 50
> >>>>
> >>>>
> >>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
> >>>>
> >>>>> Hi there,
> >>>>>
> >>>>> Showing the query would be helpful but some general remarks:
> >>>>>
> >>>>> 1/ If the query or the setup for Fuseki is needing more than the
> >>>>> default
> >>>>> heap size, then it might be that the Java JVM is getting into a
> >>>>> state of
> >>>>> heap exhaustion. This manifests as the CPU loading getting very
> >>>>> high. It
> >>>>> will seem like nothing is happening (waiting for response).
> >>>>>
> >>>>> 2/ The query may be expensive.
> >>>>>
> >>>>> Things to look for
> >>>>> * cross products - two parts of the query pattern that are not
> >>>>> connected.
> >>>>>
> >>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> >>>>>
> >>>>> * sort, spilling to disk or combined with a cross product the query.
> >>>>>
> >>>>> 3/ If no results are coming back, then the query is form that does
> >>>>> not
> >>>>> stream back - sort, or CONSTRUCT maybe.
> >>>>>
> >>>>> There was a useful presentation recently that talks about the
> >>>>> principles
> >>>>> of query efficiency.
> >>>>>
> >>>>> SPARQL Query Optimization with Pavel Klinov
> >>>>> https://www.youtube.com/watch?v=16eMswT2x2Y
> >>>>>
> >>>>> More inline:
> >>>>>
> >>>>> On 06/05/2021 09:54, Martin Van Aken wrote:
> >>>>>> Hi!
> >>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL
> >>>>>> world.
> >>> I'm
> >>>>>> currently building queries against a Fuseki/TDB backend (that I can
> >>> work
> >>>>> on
> >>>>>> too) and I'm getting into significant performance problems
> >>>>>> (including
> >>>>> never
> >>>>>> ending queries).
> >>>>>
> >>>>> Are updates also happening at the same time?
> >>>>>
> >>>>>> Despite what I thought was a good search on the apache
> >>>>>> jena website I could not find a lot of insight about performance
> >>>>>> investigation so I'm trying it here.
> >>>>>>
> >>>>>> Most of my data experience comes from the relational world (ex:
> >>>>>> PG) so
> >>>>> I'm
> >>>>>> sometimes drawing comparisons there.
> >>>>>>
> >>>>>> To give some context my data set is around 15 linked concepts,
> >>>>>> with the
> >>>>>> number of triples for each ranging from some hundreds to 500K -
> >>>>>> total
> >>>>> less
> >>>>>> than 2 millions (documents/authors/publication kind of data).
> >>>>>>
> >>>>>> Unto questions:
> >>>>>>
> >>>>>>       - When I'm facing a slow query, what are my investigation
> >>> options. Is
> >>>>>>       there an equivalent of an "explain plan" in SQL pointing to
> >>>>>> the
> >>> query
> >>>>>>       specific slow points? What's the advised way for performance
> >>> checks
> >>>>> in
> >>>>>>       SPARQL?
> >>>>>
> >>>>> qparse --print=opt --file query.rq
> >>>>>
> >>>>>>       - Are there any performance setups to be aware of on the
> >>>>>> server
> >>> side?
> >>>>>>       Like ways to check indexes are correctly built (outside of
> >>>>>> text
> >>>>> search that
> >>>>>>       I'm not working with for the moment)
> >>>>>>       - We're currently using TDB1. I've seen the transactional
> >>> benefits of
> >>>>>>       TDB2 - are there performance improvements too that would
> >>>>>> warrant a
> >>>>>>       migration there ?
> >>>>>
> >>>>> Not on the query side.
> >>>>>
> >>>>>        Andy
> >>>>>
> >>>>>>
> >>>>>> Thanks a lot already!
> >>>>>>
> >>>>>> Martin
> >>>>>>
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >>
>
>

-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : martin@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Steve Vestal <st...@adventiumlabs.com>.
Andy pointed at sequential OPTIONALs.  One example I have seen had 
nested OPTIONAL clauses to address a performance issue.  Might that be 
helpful here?

On 5/20/2021 5:43 AM, Andy Seaborne wrote:
>
>
> On 20/05/2021 09:36, Martin Van Aken wrote:
>> Andy,
>> A big thanks for this - it gives me some paths to explore. I think 
>> indeed
>> my biggest problems are in the optional parts - I'll run the test you
>> advised and also look in which case I may be able to get rid of the
>> optionals to avoid those situations that could lead to a big amount of
>> results as you mentioned. I'm already looking at getting my filters 
>> closer
>> to definition - can this be done for things other than pure equality 
>> (for
>> example for the date that are testing for a range?).
>>
>> Maybe one question about optional - I use them in some cases to avoid 
>> empty
>> results. An example is Access - some paper have an Access triple 
>> (Open or
>> Closed) - but some have none. My understanding is that if I make a link
>> without optional like:
>>
>> ?paper iospress:accessibility ?access
>
> If it is just one triple in the optional is less likely to be bad but 
> if the query uses the variable unbound later on, there will be a very 
> large number of results, many duplicates and not actually related to 
> the ?paper. I am guessing but I would be surprised is your query has 
> variants of this and it is hidden by the "distinct".
>
> This is the problem at:
>
> >> ---
> >>       OPTIONAL {
> >>           ?author iospress:contributorAffiliation ?affiliation.
> >>           ?affiliation rdfs:label ?university;
> >>       }
> >>        OPTIONAL {
> >>         ?affiliation iospress:geocodingOutput ?geocoded.
> >>         ?geocoded iospress-geocode:country ?country
> >>       }
> >> ---
>
> If no ?affiliation, then the second OPTIONAL is over the whole 
> database which I'm guess is many results.
>
>     Andy
>
>> this will de facto remove all papers without access from the set. 
>> This is
>> something I don't want (I want them in the list, just with an empty 
>> value
>> there) - and my understanding is that the way to manage this is an
>> Optional. Is this correct? Is there a "better" way? If this ends up 
>> being
>> costly, I could also check to actually have a value for those (those
>> without value are technically "Closed").
>>
>> Something I was wondering also is whether it makes sense to split the
>> fields I need for search/filtering vs the ones I want to see on the 
>> result.
>> I've a feeling that in theory I could play with two queries - one 
>> with only
>> the params I need for the filtering, then play something similar to
>> DESCRIBE on each record on the filtered set - but I've no idea if this
>> would be more performant than keeping it together as it is now.
>>
>> Anyway, the exchanges here are much appreciated!
>>
>> On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:
>>
>>> Martin,
>>>
>>> That's a complicated query and I haven't got my head aroud it 
>>> completely
>>> yet.
>>>
>>> There are some useful points to understand:
>>>
>>> A::
>>>
>>> What is the time and outcome of these queries that focus on the main
>>> data location part:
>>>
>>> 1/
>>>
>>> SELECT (count(*) AS ?C) {
>>>    ?paper  iospress:publicationDate ?pubDate
>>>    FILTER(...date test...)
>>> }
>>>
>>> 2/
>>>    SELECT (count(*) AS ?C) {
>>>    ?paper  iospress:publicationDate ?pubDate
>>>            iospress:publicationIncludesKeyword ?keyword .
>>>    FILETER (...date... && (regex (?keyword, "sickness", "i"))
>>>
>>> 3/
>>> SELECT (count(*) AS ?C) {
>>>     {?paper rdf:type iospress:Chapter.}
>>>               union
>>>     {?paper rdf:type iospress:Article.}
>>>     ?paper  iospress:publicationDate ?pubDate
>>>     FILTER(...date test))
>>> }
>>>
>>> 4/
>>> SELECT (count(*) AS ?C) {
>>>    ?paper  iospress:publicationDate ?pubDate
>>>    FILTER(.. date test...)
>>>     {?paper rdf:type iospress:Chapter.}
>>>               union
>>>     {?paper rdf:type iospress:Article.}
>>> }
>>>
>>> B::
>>>
>>> then is it the case that some optionals have more effect than others?
>>> Some are "high risk":
>>>
>>> ---
>>>       OPTIONAL {
>>>           ?author iospress:contributorAffiliation ?affiliation.
>>>           ?affiliation rdfs:label ?university;
>>>       }
>>>        OPTIONAL {
>>>         ?affiliation iospress:geocodingOutput ?geocoded.
>>>         ?geocoded iospress-geocode:country ?country
>>>       }
>>> ---
>>> Suppose the first does not match then the second is a lot of results
>>> unrelated to ?paper.
>>>
>>> C::
>>>
>>> distinct
>>>
>>> it might be worth trying without distinct because distinct can cause a
>>> lot of results to be reduced to just a few, hiding redundant work.
>>>
>>>       Andy
>>>
>>> On 18/05/2021 13:31, Martin Van Aken wrote:
>>>> Hello again,
>>>> After some more days of me trying to get a better performance & the
>>>> approval of my company, here is what we try to run (query at the 
>>>> bottom
>>> of
>>>> the mail).
>>>>
>>>> For some context:
>>>>
>>>> - This is a search for academia papers. Papers have multiple 
>>>> authors, and
>>>> authors are part of multiple universities. Papers also have multiple
>>>> keywords and are generally part of a set (an issue) itself part of 
>>>> a set
>>> (a
>>>> volume) itself part of a set (a journal).
>>>> - Our goal is to have a multicriteria search front end, so the 
>>>> query is
>>>> generated from a form with clauses selected by the user. The 
>>>> structure is
>>>> always the same, this example use a single condition on the "keyword"
>>>> - The set of data is relatively small - around 150k papers (so 
>>>> probably
>>> 1M
>>>> triples there), probably around 500k authors
>>>> - We use group/concat as we want to give as results one line per paper
>>> (vs
>>>> having one per paper per keyword for example)
>>>> - I've read OPTIONALS are pretty bad - but I've no real alternative 
>>>> here
>>>> that I know off when some fields can be present or not and I don't 
>>>> want
>>> to
>>>> throw away all that miss at least one
>>>>
>>>> For our current results, all but the most precise queries (getting 
>>>> into a
>>>> super limited set of papers, like <10) get extremely slow (easily to
>>> dozens
>>>> of seconds, sometimes more). I feel that there is something obvious 
>>>> that
>>>> I'm missing, either in the query or my Jena config. The server is 
>>>> on an
>>> old
>>>> version but I make my tests locally on a 4.0.0 "out of the box" (0
>>>> configuration).
>>>>
>>>> What I've tried:
>>>>
>>>> - Removing the ORDER does not impact much
>>>> - Removing most optionals works... but remove the point of the query
>>>> - Using contains instead of regex does not impact much (I've the 
>>>> goal to
>>>> use Jena/Lucene integration for everything text related)
>>>>
>>>> I'm really in for an opinion as taking my RDBMS background this is the
>>>> equivalent of less than 3M records split on around 8 tables - 
>>>> something
>>>> that should be queryable mostly in sub second times.
>>>>
>>>> Any feedback is most welcome !
>>>>
>>>> Martin
>>>>
>>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>>       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>>>       PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
>>>>       PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
>>>>       PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
>>>>       PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>>>>
>>>>       SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
>>>>           (group_concat(distinct ?authorName;separator=", ") as 
>>>> ?Authors)
>>>>           (group_concat(distinct ?keyword;separator=", ") as 
>>>> ?keywords)
>>>>           (group_concat(distinct ?university;separator=", ") as
>>> ?universities)
>>>>           (group_concat(distinct ?country;separator=", ") as 
>>>> ?countries)
>>>>       WHERE {
>>>>           {?paper rdf:type iospress:Chapter.}
>>>>               union
>>>>           {?paper rdf:type iospress:Article.}
>>>>
>>>>           ?paper rdfs:label ?title;
>>>>                    rdf:type ?type;
>>>>
>>>>                    iospress:publicationDate ?pubDate;
>>>>                    iospress:publicationAbstract ?abstract;
>>>>
>>>>                    iospress:publicationIncludesKeyword ?keyword;
>>>>                    iospress:publicationAuthorList [?idx ?author].
>>>>
>>>>           ?issueOrBook iospress:partOf ?volumeOrSerie.
>>>>           ?paper iospress:partOf ?issueOrBook.
>>>>
>>>>
>>>>       OPTIONAL {
>>>>           ?issueOrBook iospress:isbn ?bookIsbn.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?paper iospress:publicationDoiUrl ?doi.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?author rdfs:label ?authorName.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?author iospress:contributorAffiliation ?affiliation.
>>>>           ?affiliation rdfs:label ?university;
>>>>       }
>>>>        OPTIONAL {
>>>>         ?affiliation iospress:geocodingOutput ?geocoded.
>>>>         ?geocoded iospress-geocode:country ?country
>>>>       }
>>>>       OPTIONAL {
>>>>           ?paper iospress:publicationAccessibility ?access.
>>>>       }
>>>>       OPTIONAL {
>>>>           ?volumeOrSerie iospress:partOf ?journal;
>>>>       }
>>>>       FILTER(
>>>>           (
>>>>               (datatype(?pubDate) = xsd:date && 
>>>> xsd:dateTime(?pubDate) >
>>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
>>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
>>>>               (datatype(?pubDate) = xsd:gYear && ?pubDate >=
>>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
>>>>           )
>>>>
>>>>           && (regex (?keyword, "sickness", "i"))
>>>>           )
>>>>       }
>>>>       GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
>>>>
>>>>       ORDER BY ?pubDate ?paper
>>>>       LIMIT 50
>>>>
>>>>
>>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
>>>>
>>>>> Hi there,
>>>>>
>>>>> Showing the query would be helpful but some general remarks:
>>>>>
>>>>> 1/ If the query or the setup for Fuseki is needing more than the 
>>>>> default
>>>>> heap size, then it might be that the Java JVM is getting into a 
>>>>> state of
>>>>> heap exhaustion. This manifests as the CPU loading getting very 
>>>>> high. It
>>>>> will seem like nothing is happening (waiting for response).
>>>>>
>>>>> 2/ The query may be expensive.
>>>>>
>>>>> Things to look for
>>>>> * cross products - two parts of the query pattern that are not
>>>>> connected.
>>>>>
>>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
>>>>>
>>>>> * sort, spilling to disk or combined with a cross product the query.
>>>>>
>>>>> 3/ If no results are coming back, then the query is form that does 
>>>>> not
>>>>> stream back - sort, or CONSTRUCT maybe.
>>>>>
>>>>> There was a useful presentation recently that talks about the 
>>>>> principles
>>>>> of query efficiency.
>>>>>
>>>>> SPARQL Query Optimization with Pavel Klinov
>>>>> https://www.youtube.com/watch?v=16eMswT2x2Y
>>>>>
>>>>> More inline:
>>>>>
>>>>> On 06/05/2021 09:54, Martin Van Aken wrote:
>>>>>> Hi!
>>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL 
>>>>>> world.
>>> I'm
>>>>>> currently building queries against a Fuseki/TDB backend (that I can
>>> work
>>>>> on
>>>>>> too) and I'm getting into significant performance problems 
>>>>>> (including
>>>>> never
>>>>>> ending queries).
>>>>>
>>>>> Are updates also happening at the same time?
>>>>>
>>>>>> Despite what I thought was a good search on the apache
>>>>>> jena website I could not find a lot of insight about performance
>>>>>> investigation so I'm trying it here.
>>>>>>
>>>>>> Most of my data experience comes from the relational world (ex: 
>>>>>> PG) so
>>>>> I'm
>>>>>> sometimes drawing comparisons there.
>>>>>>
>>>>>> To give some context my data set is around 15 linked concepts, 
>>>>>> with the
>>>>>> number of triples for each ranging from some hundreds to 500K - 
>>>>>> total
>>>>> less
>>>>>> than 2 millions (documents/authors/publication kind of data).
>>>>>>
>>>>>> Unto questions:
>>>>>>
>>>>>>       - When I'm facing a slow query, what are my investigation
>>> options. Is
>>>>>>       there an equivalent of an "explain plan" in SQL pointing to 
>>>>>> the
>>> query
>>>>>>       specific slow points? What's the advised way for performance
>>> checks
>>>>> in
>>>>>>       SPARQL?
>>>>>
>>>>> qparse --print=opt --file query.rq
>>>>>
>>>>>>       - Are there any performance setups to be aware of on the 
>>>>>> server
>>> side?
>>>>>>       Like ways to check indexes are correctly built (outside of 
>>>>>> text
>>>>> search that
>>>>>>       I'm not working with for the moment)
>>>>>>       - We're currently using TDB1. I've seen the transactional
>>> benefits of
>>>>>>       TDB2 - are there performance improvements too that would 
>>>>>> warrant a
>>>>>>       migration there ?
>>>>>
>>>>> Not on the query side.
>>>>>
>>>>>        Andy
>>>>>
>>>>>>
>>>>>> Thanks a lot already!
>>>>>>
>>>>>> Martin
>>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>>


Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Andy Seaborne <an...@apache.org>.

On 20/05/2021 09:36, Martin Van Aken wrote:
> Andy,
> A big thanks for this - it gives me some paths to explore. I think indeed
> my biggest problems are in the optional parts - I'll run the test you
> advised and also look in which case I may be able to get rid of the
> optionals to avoid those situations that could lead to a big amount of
> results as you mentioned. I'm already looking at getting my filters closer
> to definition - can this be done for things other than pure equality (for
> example for the date that are testing for a range?).
> 
> Maybe one question about optional - I use them in some cases to avoid empty
> results. An example is Access - some paper have an Access triple (Open or
> Closed) - but some have none. My understanding is that if I make a link
> without optional like:
> 
> ?paper iospress:accessibility ?access

If it is just one triple in the optional is less likely to be bad but if 
the query uses the variable unbound later on, there will be a very large 
number of results, many duplicates and not actually related to the 
?paper. I am guessing but I would be surprised is your query has 
variants of this and it is hidden by the "distinct".

This is the problem at:

 >> ---
 >>       OPTIONAL {
 >>           ?author iospress:contributorAffiliation ?affiliation.
 >>           ?affiliation rdfs:label ?university;
 >>       }
 >>        OPTIONAL {
 >>         ?affiliation iospress:geocodingOutput ?geocoded.
 >>         ?geocoded iospress-geocode:country ?country
 >>       }
 >> ---

If no ?affiliation, then the second OPTIONAL is over the whole database 
which I'm guess is many results.

     Andy

> this will de facto remove all papers without access from the set. This is
> something I don't want (I want them in the list, just with an empty value
> there) - and my understanding is that the way to manage this is an
> Optional. Is this correct? Is there a "better" way? If this ends up being
> costly, I could also check to actually have a value for those (those
> without value are technically "Closed").
> 
> Something I was wondering also is whether it makes sense to split the
> fields I need for search/filtering vs the ones I want to see on the result.
> I've a feeling that in theory I could play with two queries - one with only
> the params I need for the filtering, then play something similar to
> DESCRIBE on each record on the filtered set - but I've no idea if this
> would be more performant than keeping it together as it is now.
> 
> Anyway, the exchanges here are much appreciated!
> 
> On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:
> 
>> Martin,
>>
>> That's a complicated query and I haven't got my head aroud it completely
>> yet.
>>
>> There are some useful points to understand:
>>
>> A::
>>
>> What is the time and outcome of these queries that focus on the main
>> data location part:
>>
>> 1/
>>
>> SELECT (count(*) AS ?C) {
>>    ?paper  iospress:publicationDate ?pubDate
>>    FILTER(...date test...)
>> }
>>
>> 2/
>>    SELECT (count(*) AS ?C) {
>>    ?paper  iospress:publicationDate ?pubDate
>>            iospress:publicationIncludesKeyword ?keyword .
>>    FILETER (...date... && (regex (?keyword, "sickness", "i"))
>>
>> 3/
>> SELECT (count(*) AS ?C) {
>>     {?paper rdf:type iospress:Chapter.}
>>               union
>>     {?paper rdf:type iospress:Article.}
>>     ?paper  iospress:publicationDate ?pubDate
>>     FILTER(...date test))
>> }
>>
>> 4/
>> SELECT (count(*) AS ?C) {
>>    ?paper  iospress:publicationDate ?pubDate
>>    FILTER(.. date test...)
>>     {?paper rdf:type iospress:Chapter.}
>>               union
>>     {?paper rdf:type iospress:Article.}
>> }
>>
>> B::
>>
>> then is it the case that some optionals have more effect than others?
>> Some are "high risk":
>>
>> ---
>>       OPTIONAL {
>>           ?author iospress:contributorAffiliation ?affiliation.
>>           ?affiliation rdfs:label ?university;
>>       }
>>        OPTIONAL {
>>         ?affiliation iospress:geocodingOutput ?geocoded.
>>         ?geocoded iospress-geocode:country ?country
>>       }
>> ---
>> Suppose the first does not match then the second is a lot of results
>> unrelated to ?paper.
>>
>> C::
>>
>> distinct
>>
>> it might be worth trying without distinct because distinct can cause a
>> lot of results to be reduced to just a few, hiding redundant work.
>>
>>       Andy
>>
>> On 18/05/2021 13:31, Martin Van Aken wrote:
>>> Hello again,
>>> After some more days of me trying to get a better performance & the
>>> approval of my company, here is what we try to run (query at the bottom
>> of
>>> the mail).
>>>
>>> For some context:
>>>
>>> - This is a search for academia papers. Papers have multiple authors, and
>>> authors are part of multiple universities. Papers also have multiple
>>> keywords and are generally part of a set (an issue) itself part of a set
>> (a
>>> volume) itself part of a set (a journal).
>>> - Our goal is to have a multicriteria search front end, so the query is
>>> generated from a form with clauses selected by the user. The structure is
>>> always the same, this example use a single condition on the "keyword"
>>> - The set of data is relatively small - around 150k papers (so probably
>> 1M
>>> triples there), probably around 500k authors
>>> - We use group/concat as we want to give as results one line per paper
>> (vs
>>> having one per paper per keyword for example)
>>> - I've read OPTIONALS are pretty bad - but I've no real alternative here
>>> that I know off when some fields can be present or not and I don't want
>> to
>>> throw away all that miss at least one
>>>
>>> For our current results, all but the most precise queries (getting into a
>>> super limited set of papers, like <10) get extremely slow (easily to
>> dozens
>>> of seconds, sometimes more). I feel that there is something obvious that
>>> I'm missing, either in the query or my Jena config. The server is on an
>> old
>>> version but I make my tests locally on a 4.0.0 "out of the box" (0
>>> configuration).
>>>
>>> What I've tried:
>>>
>>> - Removing the ORDER does not impact much
>>> - Removing most optionals works... but remove the point of the query
>>> - Using contains instead of regex does not impact much (I've the goal to
>>> use Jena/Lucene integration for everything text related)
>>>
>>> I'm really in for an opinion as taking my RDBMS background this is the
>>> equivalent of less than 3M records split on around 8 tables - something
>>> that should be queryable mostly in sub second times.
>>>
>>> Any feedback is most welcome !
>>>
>>> Martin
>>>
>>> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>>>       PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>>>       PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
>>>       PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
>>>       PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
>>>       PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
>>>
>>>       SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
>>>           (group_concat(distinct ?authorName;separator=", ") as ?Authors)
>>>           (group_concat(distinct ?keyword;separator=", ") as ?keywords)
>>>           (group_concat(distinct ?university;separator=", ") as
>> ?universities)
>>>           (group_concat(distinct ?country;separator=", ") as ?countries)
>>>       WHERE {
>>>           {?paper rdf:type iospress:Chapter.}
>>>               union
>>>           {?paper rdf:type iospress:Article.}
>>>
>>>           ?paper rdfs:label ?title;
>>>                    rdf:type ?type;
>>>
>>>                    iospress:publicationDate ?pubDate;
>>>                    iospress:publicationAbstract ?abstract;
>>>
>>>                    iospress:publicationIncludesKeyword ?keyword;
>>>                    iospress:publicationAuthorList [?idx ?author].
>>>
>>>           ?issueOrBook iospress:partOf ?volumeOrSerie.
>>>           ?paper iospress:partOf ?issueOrBook.
>>>
>>>
>>>       OPTIONAL {
>>>           ?issueOrBook iospress:isbn ?bookIsbn.
>>>       }
>>>       OPTIONAL {
>>>           ?paper iospress:publicationDoiUrl ?doi.
>>>       }
>>>       OPTIONAL {
>>>           ?author rdfs:label ?authorName.
>>>       }
>>>       OPTIONAL {
>>>           ?author iospress:contributorAffiliation ?affiliation.
>>>           ?affiliation rdfs:label ?university;
>>>       }
>>>        OPTIONAL {
>>>         ?affiliation iospress:geocodingOutput ?geocoded.
>>>         ?geocoded iospress-geocode:country ?country
>>>       }
>>>       OPTIONAL {
>>>           ?paper iospress:publicationAccessibility ?access.
>>>       }
>>>       OPTIONAL {
>>>           ?volumeOrSerie iospress:partOf ?journal;
>>>       }
>>>       FILTER(
>>>           (
>>>               (datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) >
>>> "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
>>> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
>>>               (datatype(?pubDate) = xsd:gYear && ?pubDate >=
>>> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
>>>           )
>>>
>>>           && (regex (?keyword, "sickness", "i"))
>>>           )
>>>       }
>>>       GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
>>>
>>>       ORDER BY ?pubDate ?paper
>>>       LIMIT 50
>>>
>>>
>>> On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> Hi there,
>>>>
>>>> Showing the query would be helpful but some general remarks:
>>>>
>>>> 1/ If the query or the setup for Fuseki is needing more than the default
>>>> heap size, then it might be that the Java JVM is getting into a state of
>>>> heap exhaustion. This manifests as the CPU loading getting very high. It
>>>> will seem like nothing is happening (waiting for response).
>>>>
>>>> 2/ The query may be expensive.
>>>>
>>>> Things to look for
>>>> * cross products - two parts of the query pattern that are not
>>>> connected.
>>>>
>>>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
>>>>
>>>> * sort, spilling to disk or combined with a cross product the query.
>>>>
>>>> 3/ If no results are coming back, then the query is form that does not
>>>> stream back - sort, or CONSTRUCT maybe.
>>>>
>>>> There was a useful presentation recently that talks about the principles
>>>> of query efficiency.
>>>>
>>>> SPARQL Query Optimization with Pavel Klinov
>>>> https://www.youtube.com/watch?v=16eMswT2x2Y
>>>>
>>>> More inline:
>>>>
>>>> On 06/05/2021 09:54, Martin Van Aken wrote:
>>>>> Hi!
>>>>> I'm Martin, I'm a software developer new to the Triples/SPARQL world.
>> I'm
>>>>> currently building queries against a Fuseki/TDB backend (that I can
>> work
>>>> on
>>>>> too) and I'm getting into significant performance problems (including
>>>> never
>>>>> ending queries).
>>>>
>>>> Are updates also happening at the same time?
>>>>
>>>>> Despite what I thought was a good search on the apache
>>>>> jena website I could not find a lot of insight about performance
>>>>> investigation so I'm trying it here.
>>>>>
>>>>> Most of my data experience comes from the relational world (ex: PG) so
>>>> I'm
>>>>> sometimes drawing comparisons there.
>>>>>
>>>>> To give some context my data set is around 15 linked concepts, with the
>>>>> number of triples for each ranging from some hundreds to 500K - total
>>>> less
>>>>> than 2 millions (documents/authors/publication kind of data).
>>>>>
>>>>> Unto questions:
>>>>>
>>>>>       - When I'm facing a slow query, what are my investigation
>> options. Is
>>>>>       there an equivalent of an "explain plan" in SQL pointing to the
>> query
>>>>>       specific slow points? What's the advised way for performance
>> checks
>>>> in
>>>>>       SPARQL?
>>>>
>>>> qparse --print=opt --file query.rq
>>>>
>>>>>       - Are there any performance setups to be aware of on the server
>> side?
>>>>>       Like ways to check indexes are correctly built (outside of text
>>>> search that
>>>>>       I'm not working with for the moment)
>>>>>       - We're currently using TDB1. I've seen the transactional
>> benefits of
>>>>>       TDB2 - are there performance improvements too that would warrant a
>>>>>       migration there ?
>>>>
>>>> Not on the query side.
>>>>
>>>>        Andy
>>>>
>>>>>
>>>>> Thanks a lot already!
>>>>>
>>>>> Martin
>>>>>
>>>>
>>>
>>>
>>
> 
> 

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martin Van Aken <ma...@joyouscoding.com>.
Andy,
A big thanks for this - it gives me some paths to explore. I think indeed
my biggest problems are in the optional parts - I'll run the test you
advised and also look in which case I may be able to get rid of the
optionals to avoid those situations that could lead to a big amount of
results as you mentioned. I'm already looking at getting my filters closer
to definition - can this be done for things other than pure equality (for
example for the date that are testing for a range?).

Maybe one question about optional - I use them in some cases to avoid empty
results. An example is Access - some paper have an Access triple (Open or
Closed) - but some have none. My understanding is that if I make a link
without optional like:

?paper iospress:accessibility ?access

this will de facto remove all papers without access from the set. This is
something I don't want (I want them in the list, just with an empty value
there) - and my understanding is that the way to manage this is an
Optional. Is this correct? Is there a "better" way? If this ends up being
costly, I could also check to actually have a value for those (those
without value are technically "Closed").

Something I was wondering also is whether it makes sense to split the
fields I need for search/filtering vs the ones I want to see on the result.
I've a feeling that in theory I could play with two queries - one with only
the params I need for the filtering, then play something similar to
DESCRIBE on each record on the filtered set - but I've no idea if this
would be more performant than keeping it together as it is now.

Anyway, the exchanges here are much appreciated!

On Tue, 18 May 2021 at 19:18, Andy Seaborne <an...@apache.org> wrote:

> Martin,
>
> That's a complicated query and I haven't got my head aroud it completely
> yet.
>
> There are some useful points to understand:
>
> A::
>
> What is the time and outcome of these queries that focus on the main
> data location part:
>
> 1/
>
> SELECT (count(*) AS ?C) {
>   ?paper  iospress:publicationDate ?pubDate
>   FILTER(...date test...)
> }
>
> 2/
>   SELECT (count(*) AS ?C) {
>   ?paper  iospress:publicationDate ?pubDate
>           iospress:publicationIncludesKeyword ?keyword .
>   FILETER (...date... && (regex (?keyword, "sickness", "i"))
>
> 3/
> SELECT (count(*) AS ?C) {
>    {?paper rdf:type iospress:Chapter.}
>              union
>    {?paper rdf:type iospress:Article.}
>    ?paper  iospress:publicationDate ?pubDate
>    FILTER(...date test))
> }
>
> 4/
> SELECT (count(*) AS ?C) {
>   ?paper  iospress:publicationDate ?pubDate
>   FILTER(.. date test...)
>    {?paper rdf:type iospress:Chapter.}
>              union
>    {?paper rdf:type iospress:Article.}
> }
>
> B::
>
> then is it the case that some optionals have more effect than others?
> Some are "high risk":
>
> ---
>      OPTIONAL {
>          ?author iospress:contributorAffiliation ?affiliation.
>          ?affiliation rdfs:label ?university;
>      }
>       OPTIONAL {
>        ?affiliation iospress:geocodingOutput ?geocoded.
>        ?geocoded iospress-geocode:country ?country
>      }
> ---
> Suppose the first does not match then the second is a lot of results
> unrelated to ?paper.
>
> C::
>
> distinct
>
> it might be worth trying without distinct because distinct can cause a
> lot of results to be reduced to just a few, hiding redundant work.
>
>      Andy
>
> On 18/05/2021 13:31, Martin Van Aken wrote:
> > Hello again,
> > After some more days of me trying to get a better performance & the
> > approval of my company, here is what we try to run (query at the bottom
> of
> > the mail).
> >
> > For some context:
> >
> > - This is a search for academia papers. Papers have multiple authors, and
> > authors are part of multiple universities. Papers also have multiple
> > keywords and are generally part of a set (an issue) itself part of a set
> (a
> > volume) itself part of a set (a journal).
> > - Our goal is to have a multicriteria search front end, so the query is
> > generated from a form with clauses selected by the user. The structure is
> > always the same, this example use a single condition on the "keyword"
> > - The set of data is relatively small - around 150k papers (so probably
> 1M
> > triples there), probably around 500k authors
> > - We use group/concat as we want to give as results one line per paper
> (vs
> > having one per paper per keyword for example)
> > - I've read OPTIONALS are pretty bad - but I've no real alternative here
> > that I know off when some fields can be present or not and I don't want
> to
> > throw away all that miss at least one
> >
> > For our current results, all but the most precise queries (getting into a
> > super limited set of papers, like <10) get extremely slow (easily to
> dozens
> > of seconds, sometimes more). I feel that there is something obvious that
> > I'm missing, either in the query or my Jena config. The server is on an
> old
> > version but I make my tests locally on a 4.0.0 "out of the box" (0
> > configuration).
> >
> > What I've tried:
> >
> > - Removing the ORDER does not impact much
> > - Removing most optionals works... but remove the point of the query
> > - Using contains instead of regex does not impact much (I've the goal to
> > use Jena/Lucene integration for everything text related)
> >
> > I'm really in for an opinion as taking my RDBMS background this is the
> > equivalent of less than 3M records split on around 8 tables - something
> > that should be queryable mostly in sub second times.
> >
> > Any feedback is most welcome !
> >
> > Martin
> >
> > PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
> >      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
> >      PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
> >      PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
> >      PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
> >      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> >
> >      SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
> >          (group_concat(distinct ?authorName;separator=", ") as ?Authors)
> >          (group_concat(distinct ?keyword;separator=", ") as ?keywords)
> >          (group_concat(distinct ?university;separator=", ") as
> ?universities)
> >          (group_concat(distinct ?country;separator=", ") as ?countries)
> >      WHERE {
> >          {?paper rdf:type iospress:Chapter.}
> >              union
> >          {?paper rdf:type iospress:Article.}
> >
> >          ?paper rdfs:label ?title;
> >                   rdf:type ?type;
> >
> >                   iospress:publicationDate ?pubDate;
> >                   iospress:publicationAbstract ?abstract;
> >
> >                   iospress:publicationIncludesKeyword ?keyword;
> >                   iospress:publicationAuthorList [?idx ?author].
> >
> >          ?issueOrBook iospress:partOf ?volumeOrSerie.
> >          ?paper iospress:partOf ?issueOrBook.
> >
> >
> >      OPTIONAL {
> >          ?issueOrBook iospress:isbn ?bookIsbn.
> >      }
> >      OPTIONAL {
> >          ?paper iospress:publicationDoiUrl ?doi.
> >      }
> >      OPTIONAL {
> >          ?author rdfs:label ?authorName.
> >      }
> >      OPTIONAL {
> >          ?author iospress:contributorAffiliation ?affiliation.
> >          ?affiliation rdfs:label ?university;
> >      }
> >       OPTIONAL {
> >        ?affiliation iospress:geocodingOutput ?geocoded.
> >        ?geocoded iospress-geocode:country ?country
> >      }
> >      OPTIONAL {
> >          ?paper iospress:publicationAccessibility ?access.
> >      }
> >      OPTIONAL {
> >          ?volumeOrSerie iospress:partOf ?journal;
> >      }
> >      FILTER(
> >          (
> >              (datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) >
> > "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
> > "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
> >              (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> > "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
> >          )
> >
> >          && (regex (?keyword, "sickness", "i"))
> >          )
> >      }
> >      GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> >
> >      ORDER BY ?pubDate ?paper
> >      LIMIT 50
> >
> >
> > On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
> >
> >> Hi there,
> >>
> >> Showing the query would be helpful but some general remarks:
> >>
> >> 1/ If the query or the setup for Fuseki is needing more than the default
> >> heap size, then it might be that the Java JVM is getting into a state of
> >> heap exhaustion. This manifests as the CPU loading getting very high. It
> >> will seem like nothing is happening (waiting for response).
> >>
> >> 2/ The query may be expensive.
> >>
> >> Things to look for
> >> * cross products - two parts of the query pattern that are not
> >> connected.
> >>
> >> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> >>
> >> * sort, spilling to disk or combined with a cross product the query.
> >>
> >> 3/ If no results are coming back, then the query is form that does not
> >> stream back - sort, or CONSTRUCT maybe.
> >>
> >> There was a useful presentation recently that talks about the principles
> >> of query efficiency.
> >>
> >> SPARQL Query Optimization with Pavel Klinov
> >> https://www.youtube.com/watch?v=16eMswT2x2Y
> >>
> >> More inline:
> >>
> >> On 06/05/2021 09:54, Martin Van Aken wrote:
> >>> Hi!
> >>> I'm Martin, I'm a software developer new to the Triples/SPARQL world.
> I'm
> >>> currently building queries against a Fuseki/TDB backend (that I can
> work
> >> on
> >>> too) and I'm getting into significant performance problems (including
> >> never
> >>> ending queries).
> >>
> >> Are updates also happening at the same time?
> >>
> >>> Despite what I thought was a good search on the apache
> >>> jena website I could not find a lot of insight about performance
> >>> investigation so I'm trying it here.
> >>>
> >>> Most of my data experience comes from the relational world (ex: PG) so
> >> I'm
> >>> sometimes drawing comparisons there.
> >>>
> >>> To give some context my data set is around 15 linked concepts, with the
> >>> number of triples for each ranging from some hundreds to 500K - total
> >> less
> >>> than 2 millions (documents/authors/publication kind of data).
> >>>
> >>> Unto questions:
> >>>
> >>>      - When I'm facing a slow query, what are my investigation
> options. Is
> >>>      there an equivalent of an "explain plan" in SQL pointing to the
> query
> >>>      specific slow points? What's the advised way for performance
> checks
> >> in
> >>>      SPARQL?
> >>
> >> qparse --print=opt --file query.rq
> >>
> >>>      - Are there any performance setups to be aware of on the server
> side?
> >>>      Like ways to check indexes are correctly built (outside of text
> >> search that
> >>>      I'm not working with for the moment)
> >>>      - We're currently using TDB1. I've seen the transactional
> benefits of
> >>>      TDB2 - are there performance improvements too that would warrant a
> >>>      migration there ?
> >>
> >> Not on the query side.
> >>
> >>       Andy
> >>
> >>>
> >>> Thanks a lot already!
> >>>
> >>> Martin
> >>>
> >>
> >
> >
>


-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : martin@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Andy Seaborne <an...@apache.org>.
Martin,

That's a complicated query and I haven't got my head aroud it completely 
yet.

There are some useful points to understand:

A::

What is the time and outcome of these queries that focus on the main 
data location part:

1/

SELECT (count(*) AS ?C) {
  ?paper  iospress:publicationDate ?pubDate
  FILTER(...date test...)
}

2/
  SELECT (count(*) AS ?C) {
  ?paper  iospress:publicationDate ?pubDate
          iospress:publicationIncludesKeyword ?keyword .
  FILETER (...date... && (regex (?keyword, "sickness", "i"))

3/
SELECT (count(*) AS ?C) {
   {?paper rdf:type iospress:Chapter.}
             union
   {?paper rdf:type iospress:Article.}
   ?paper  iospress:publicationDate ?pubDate
   FILTER(...date test))
}

4/
SELECT (count(*) AS ?C) {
  ?paper  iospress:publicationDate ?pubDate
  FILTER(.. date test...)
   {?paper rdf:type iospress:Chapter.}
             union
   {?paper rdf:type iospress:Article.}
}

B::

then is it the case that some optionals have more effect than others?
Some are "high risk":

---
     OPTIONAL {
         ?author iospress:contributorAffiliation ?affiliation.
         ?affiliation rdfs:label ?university;
     }
      OPTIONAL {
       ?affiliation iospress:geocodingOutput ?geocoded.
       ?geocoded iospress-geocode:country ?country
     }
---
Suppose the first does not match then the second is a lot of results 
unrelated to ?paper.

C::

distinct

it might be worth trying without distinct because distinct can cause a 
lot of results to be reduced to just a few, hiding redundant work.

     Andy

On 18/05/2021 13:31, Martin Van Aken wrote:
> Hello again,
> After some more days of me trying to get a better performance & the
> approval of my company, here is what we try to run (query at the bottom of
> the mail).
> 
> For some context:
> 
> - This is a search for academia papers. Papers have multiple authors, and
> authors are part of multiple universities. Papers also have multiple
> keywords and are generally part of a set (an issue) itself part of a set (a
> volume) itself part of a set (a journal).
> - Our goal is to have a multicriteria search front end, so the query is
> generated from a form with clauses selected by the user. The structure is
> always the same, this example use a single condition on the "keyword"
> - The set of data is relatively small - around 150k papers (so probably 1M
> triples there), probably around 500k authors
> - We use group/concat as we want to give as results one line per paper (vs
> having one per paper per keyword for example)
> - I've read OPTIONALS are pretty bad - but I've no real alternative here
> that I know off when some fields can be present or not and I don't want to
> throw away all that miss at least one
> 
> For our current results, all but the most precise queries (getting into a
> super limited set of papers, like <10) get extremely slow (easily to dozens
> of seconds, sometimes more). I feel that there is something obvious that
> I'm missing, either in the query or my Jena config. The server is on an old
> version but I make my tests locally on a 4.0.0 "out of the box" (0
> configuration).
> 
> What I've tried:
> 
> - Removing the ORDER does not impact much
> - Removing most optionals works... but remove the point of the query
> - Using contains instead of regex does not impact much (I've the goal to
> use Jena/Lucene integration for everything text related)
> 
> I'm really in for an opinion as taking my RDBMS background this is the
> equivalent of less than 3M records split on around 8 tables - something
> that should be queryable mostly in sub second times.
> 
> Any feedback is most welcome !
> 
> Martin
> 
> PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
>      PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
>      PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
>      PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
>      PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
>      PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
> 
>      SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
>          (group_concat(distinct ?authorName;separator=", ") as ?Authors)
>          (group_concat(distinct ?keyword;separator=", ") as ?keywords)
>          (group_concat(distinct ?university;separator=", ") as ?universities)
>          (group_concat(distinct ?country;separator=", ") as ?countries)
>      WHERE {
>          {?paper rdf:type iospress:Chapter.}
>              union
>          {?paper rdf:type iospress:Article.}
> 
>          ?paper rdfs:label ?title;
>                   rdf:type ?type;
> 
>                   iospress:publicationDate ?pubDate;
>                   iospress:publicationAbstract ?abstract;
> 
>                   iospress:publicationIncludesKeyword ?keyword;
>                   iospress:publicationAuthorList [?idx ?author].
> 
>          ?issueOrBook iospress:partOf ?volumeOrSerie.
>          ?paper iospress:partOf ?issueOrBook.
> 
> 
>      OPTIONAL {
>          ?issueOrBook iospress:isbn ?bookIsbn.
>      }
>      OPTIONAL {
>          ?paper iospress:publicationDoiUrl ?doi.
>      }
>      OPTIONAL {
>          ?author rdfs:label ?authorName.
>      }
>      OPTIONAL {
>          ?author iospress:contributorAffiliation ?affiliation.
>          ?affiliation rdfs:label ?university;
>      }
>       OPTIONAL {
>        ?affiliation iospress:geocodingOutput ?geocoded.
>        ?geocoded iospress-geocode:country ?country
>      }
>      OPTIONAL {
>          ?paper iospress:publicationAccessibility ?access.
>      }
>      OPTIONAL {
>          ?volumeOrSerie iospress:partOf ?journal;
>      }
>      FILTER(
>          (
>              (datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) >
> "1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
> "2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
>              (datatype(?pubDate) = xsd:gYear && ?pubDate >=
> "2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
>          )
> 
>          && (regex (?keyword, "sickness", "i"))
>          )
>      }
>      GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access
> 
>      ORDER BY ?pubDate ?paper
>      LIMIT 50
> 
> 
> On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:
> 
>> Hi there,
>>
>> Showing the query would be helpful but some general remarks:
>>
>> 1/ If the query or the setup for Fuseki is needing more than the default
>> heap size, then it might be that the Java JVM is getting into a state of
>> heap exhaustion. This manifests as the CPU loading getting very high. It
>> will seem like nothing is happening (waiting for response).
>>
>> 2/ The query may be expensive.
>>
>> Things to look for
>> * cross products - two parts of the query pattern that are not
>> connected.
>>
>> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
>>
>> * sort, spilling to disk or combined with a cross product the query.
>>
>> 3/ If no results are coming back, then the query is form that does not
>> stream back - sort, or CONSTRUCT maybe.
>>
>> There was a useful presentation recently that talks about the principles
>> of query efficiency.
>>
>> SPARQL Query Optimization with Pavel Klinov
>> https://www.youtube.com/watch?v=16eMswT2x2Y
>>
>> More inline:
>>
>> On 06/05/2021 09:54, Martin Van Aken wrote:
>>> Hi!
>>> I'm Martin, I'm a software developer new to the Triples/SPARQL world. I'm
>>> currently building queries against a Fuseki/TDB backend (that I can work
>> on
>>> too) and I'm getting into significant performance problems (including
>> never
>>> ending queries).
>>
>> Are updates also happening at the same time?
>>
>>> Despite what I thought was a good search on the apache
>>> jena website I could not find a lot of insight about performance
>>> investigation so I'm trying it here.
>>>
>>> Most of my data experience comes from the relational world (ex: PG) so
>> I'm
>>> sometimes drawing comparisons there.
>>>
>>> To give some context my data set is around 15 linked concepts, with the
>>> number of triples for each ranging from some hundreds to 500K - total
>> less
>>> than 2 millions (documents/authors/publication kind of data).
>>>
>>> Unto questions:
>>>
>>>      - When I'm facing a slow query, what are my investigation options. Is
>>>      there an equivalent of an "explain plan" in SQL pointing to the query
>>>      specific slow points? What's the advised way for performance checks
>> in
>>>      SPARQL?
>>
>> qparse --print=opt --file query.rq
>>
>>>      - Are there any performance setups to be aware of on the server side?
>>>      Like ways to check indexes are correctly built (outside of text
>> search that
>>>      I'm not working with for the moment)
>>>      - We're currently using TDB1. I've seen the transactional benefits of
>>>      TDB2 - are there performance improvements too that would warrant a
>>>      migration there ?
>>
>> Not on the query side.
>>
>>       Andy
>>
>>>
>>> Thanks a lot already!
>>>
>>> Martin
>>>
>>
> 
> 

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martin Van Aken <ma...@joyouscoding.com>.
Hello again,
After some more days of me trying to get a better performance & the
approval of my company, here is what we try to run (query at the bottom of
the mail).

For some context:

- This is a search for academia papers. Papers have multiple authors, and
authors are part of multiple universities. Papers also have multiple
keywords and are generally part of a set (an issue) itself part of a set (a
volume) itself part of a set (a journal).
- Our goal is to have a multicriteria search front end, so the query is
generated from a form with clauses selected by the user. The structure is
always the same, this example use a single condition on the "keyword"
- The set of data is relatively small - around 150k papers (so probably 1M
triples there), probably around 500k authors
- We use group/concat as we want to give as results one line per paper (vs
having one per paper per keyword for example)
- I've read OPTIONALS are pretty bad - but I've no real alternative here
that I know off when some fields can be present or not and I don't want to
throw away all that miss at least one

For our current results, all but the most precise queries (getting into a
super limited set of papers, like <10) get extremely slow (easily to dozens
of seconds, sometimes more). I feel that there is something obvious that
I'm missing, either in the query or my Jena config. The server is on an old
version but I make my tests locally on a 4.0.0 "out of the box" (0
configuration).

What I've tried:

- Removing the ORDER does not impact much
- Removing most optionals works... but remove the point of the query
- Using contains instead of regex does not impact much (I've the goal to
use Jena/Lucene integration for everything text related)

I'm really in for an opinion as taking my RDBMS background this is the
equivalent of less than 3M records split on around 8 tables - something
that should be queryable mostly in sub second times.

Any feedback is most welcome !

Martin

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
    PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
    PREFIX iospress: <http://ld.iospress.nl/rdf/ontology/>
    PREFIX iospress-geocode: <http://ld.iospress.nl/rdf/geocode/>
    PREFIX iospress-dt: <http://ld.iospress.nl/rdf/datatype/>
    PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

    SELECT ?type ?pubDate ?paper ?doi ?title ?abstract ?access
        (group_concat(distinct ?authorName;separator=", ") as ?Authors)
        (group_concat(distinct ?keyword;separator=", ") as ?keywords)
        (group_concat(distinct ?university;separator=", ") as ?universities)
        (group_concat(distinct ?country;separator=", ") as ?countries)
    WHERE {
        {?paper rdf:type iospress:Chapter.}
            union
        {?paper rdf:type iospress:Article.}

        ?paper rdfs:label ?title;
                 rdf:type ?type;

                 iospress:publicationDate ?pubDate;
                 iospress:publicationAbstract ?abstract;

                 iospress:publicationIncludesKeyword ?keyword;
                 iospress:publicationAuthorList [?idx ?author].

        ?issueOrBook iospress:partOf ?volumeOrSerie.
        ?paper iospress:partOf ?issueOrBook.


    OPTIONAL {
        ?issueOrBook iospress:isbn ?bookIsbn.
    }
    OPTIONAL {
        ?paper iospress:publicationDoiUrl ?doi.
    }
    OPTIONAL {
        ?author rdfs:label ?authorName.
    }
    OPTIONAL {
        ?author iospress:contributorAffiliation ?affiliation.
        ?affiliation rdfs:label ?university;
    }
     OPTIONAL {
      ?affiliation iospress:geocodingOutput ?geocoded.
      ?geocoded iospress-geocode:country ?country
    }
    OPTIONAL {
        ?paper iospress:publicationAccessibility ?access.
    }
    OPTIONAL {
        ?volumeOrSerie iospress:partOf ?journal;
    }
    FILTER(
        (
            (datatype(?pubDate) = xsd:date && xsd:dateTime(?pubDate) >
"1999-12-31T23:00:00.000Z"^^xsd:dateTime && xsd:dateTime(?pubDate) <
"2021-05-18T12:16:58.841Z"^^xsd:dateTime ) ||
            (datatype(?pubDate) = xsd:gYear && ?pubDate >=
"2000"^^xsd:gYear && ?pubDate <= "2021"^^xsd:gYear)
        )

        && (regex (?keyword, "sickness", "i"))
        )
    }
    GROUP BY ?type ?abstract ?pubDate ?paper ?doi ?title ?access

    ORDER BY ?pubDate ?paper
    LIMIT 50


On Thu, 6 May 2021 at 20:10, Andy Seaborne <an...@apache.org> wrote:

> Hi there,
>
> Showing the query would be helpful but some general remarks:
>
> 1/ If the query or the setup for Fuseki is needing more than the default
> heap size, then it might be that the Java JVM is getting into a state of
> heap exhaustion. This manifests as the CPU loading getting very high. It
> will seem like nothing is happening (waiting for response).
>
> 2/ The query may be expensive.
>
> Things to look for
> * cross products - two parts of the query pattern that are not
> connected.
>
> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
>
> * sort, spilling to disk or combined with a cross product the query.
>
> 3/ If no results are coming back, then the query is form that does not
> stream back - sort, or CONSTRUCT maybe.
>
> There was a useful presentation recently that talks about the principles
> of query efficiency.
>
> SPARQL Query Optimization with Pavel Klinov
> https://www.youtube.com/watch?v=16eMswT2x2Y
>
> More inline:
>
> On 06/05/2021 09:54, Martin Van Aken wrote:
> > Hi!
> > I'm Martin, I'm a software developer new to the Triples/SPARQL world. I'm
> > currently building queries against a Fuseki/TDB backend (that I can work
> on
> > too) and I'm getting into significant performance problems (including
> never
> > ending queries).
>
> Are updates also happening at the same time?
>
> > Despite what I thought was a good search on the apache
> > jena website I could not find a lot of insight about performance
> > investigation so I'm trying it here.
> >
> > Most of my data experience comes from the relational world (ex: PG) so
> I'm
> > sometimes drawing comparisons there.
> >
> > To give some context my data set is around 15 linked concepts, with the
> > number of triples for each ranging from some hundreds to 500K - total
> less
> > than 2 millions (documents/authors/publication kind of data).
> >
> > Unto questions:
> >
> >     - When I'm facing a slow query, what are my investigation options. Is
> >     there an equivalent of an "explain plan" in SQL pointing to the query
> >     specific slow points? What's the advised way for performance checks
> in
> >     SPARQL?
>
> qparse --print=opt --file query.rq
>
> >     - Are there any performance setups to be aware of on the server side?
> >     Like ways to check indexes are correctly built (outside of text
> search that
> >     I'm not working with for the moment)
> >     - We're currently using TDB1. I've seen the transactional benefits of
> >     TDB2 - are there performance improvements too that would warrant a
> >     migration there ?
>
> Not on the query side.
>
>      Andy
>
> >
> > Thanks a lot already!
> >
> > Martin
> >
>


-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : martin@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Martin Van Aken <ma...@joyouscoding.com>.
Thanks a lot Andy & Steve for the advices & material provided. This is
going to be invaluable.

Martin

On Thu, 6 May 2021 at 20:55, Steve Vestal <st...@adventiumlabs.com>
wrote:

> I asked about this topic awhile ago and received some very helpful
> pointers from this forum (thanks again!).  Here is the list I collected
> during some explorations:
>
> http://www.lotico.com/index.php/SPARQL_Query_Optimization_with_Pavel_Klinov
>
>
> https://www.dropbox.com/s/knudzewbiuqkqvy/SPARQL%20Optimisation%20101%20Tutorial.pptx?dl=0
>
>
> https://events.static.linuxfound.org/sites/events/files/slides/SPARQL%20Optimisation%20101%20Tutorial.pdf
>
> https://openproceedings.org/2014/conf/edbt/Gubichev014.pdf
>
> http://sites.fas.harvard.edu/~cs265/papers/neumann-2008.pdf
>
>
>
> On 5/6/2021 1:10 PM, Andy Seaborne wrote:
> > Hi there,
> >
> > Showing the query would be helpful but some general remarks:
> >
> > 1/ If the query or the setup for Fuseki is needing more than the
> > default heap size, then it might be that the Java JVM is getting into
> > a state of heap exhaustion. This manifests as the CPU loading getting
> > very high. It will seem like nothing is happening (waiting for response).
> >
> > 2/ The query may be expensive.
> >
> > Things to look for
> > * cross products - two parts of the query pattern that are not
> > connected.
> >
> > { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
> >
> > * sort, spilling to disk or combined with a cross product the query.
> >
> > 3/ If no results are coming back, then the query is form that does not
> > stream back - sort, or CONSTRUCT maybe.
> >
> > There was a useful presentation recently that talks about the
> > principles of query efficiency.
> >
> > SPARQL Query Optimization with Pavel Klinov
> > https://www.youtube.com/watch?v=16eMswT2x2Y
> >
> > More inline:
> >
> > On 06/05/2021 09:54, Martin Van Aken wrote:
> >> Hi!
> >> I'm Martin, I'm a software developer new to the Triples/SPARQL world.
> >> I'm
> >> currently building queries against a Fuseki/TDB backend (that I can
> >> work on
> >> too) and I'm getting into significant performance problems (including
> >> never
> >> ending queries).
> >
> > Are updates also happening at the same time?
> >
> >> Despite what I thought was a good search on the apache
> >> jena website I could not find a lot of insight about performance
> >> investigation so I'm trying it here.
> >>
> >> Most of my data experience comes from the relational world (ex: PG)
> >> so I'm
> >> sometimes drawing comparisons there.
> >>
> >> To give some context my data set is around 15 linked concepts, with the
> >> number of triples for each ranging from some hundreds to 500K - total
> >> less
> >> than 2 millions (documents/authors/publication kind of data).
> >>
> >> Unto questions:
> >>
> >>     - When I'm facing a slow query, what are my investigation
> >> options. Is
> >>     there an equivalent of an "explain plan" in SQL pointing to the
> >> query
> >>     specific slow points? What's the advised way for performance
> >> checks in
> >>     SPARQL?
> >
> > qparse --print=opt --file query.rq
> >
> >>     - Are there any performance setups to be aware of on the server
> >> side?
> >>     Like ways to check indexes are correctly built (outside of text
> >> search that
> >>     I'm not working with for the moment)
> >>     - We're currently using TDB1. I've seen the transactional
> >> benefits of
> >>     TDB2 - are there performance improvements too that would warrant a
> >>     migration there ?
> >
> > Not on the query side.
> >
> >    Andy
> >
> >>
> >> Thanks a lot already!
> >>
> >> Martin
> >>
>
>

-- 
*Martin Van Aken - **Freelance Enthusiast Developer*

Mobile : +32 486 899 652

Follow me on Twitter : @martinvanaken <http://twitter.com/martinvanaken>
Call me on Skype : vanakenm
Hang out with me : martin@joyouscoding.com
Contact me on LinkedIn : http://www.linkedin.com/in/martinvanaken
Company website : www.joyouscoding.com

Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Steve Vestal <st...@adventiumlabs.com>.
I asked about this topic awhile ago and received some very helpful 
pointers from this forum (thanks again!).  Here is the list I collected 
during some explorations:

http://www.lotico.com/index.php/SPARQL_Query_Optimization_with_Pavel_Klinov

https://www.dropbox.com/s/knudzewbiuqkqvy/SPARQL%20Optimisation%20101%20Tutorial.pptx?dl=0

https://events.static.linuxfound.org/sites/events/files/slides/SPARQL%20Optimisation%20101%20Tutorial.pdf

https://openproceedings.org/2014/conf/edbt/Gubichev014.pdf

http://sites.fas.harvard.edu/~cs265/papers/neumann-2008.pdf



On 5/6/2021 1:10 PM, Andy Seaborne wrote:
> Hi there,
>
> Showing the query would be helpful but some general remarks:
>
> 1/ If the query or the setup for Fuseki is needing more than the 
> default heap size, then it might be that the Java JVM is getting into 
> a state of heap exhaustion. This manifests as the CPU loading getting 
> very high. It will seem like nothing is happening (waiting for response).
>
> 2/ The query may be expensive.
>
> Things to look for
> * cross products - two parts of the query pattern that are not
> connected.
>
> { ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.
>
> * sort, spilling to disk or combined with a cross product the query.
>
> 3/ If no results are coming back, then the query is form that does not 
> stream back - sort, or CONSTRUCT maybe.
>
> There was a useful presentation recently that talks about the 
> principles of query efficiency.
>
> SPARQL Query Optimization with Pavel Klinov
> https://www.youtube.com/watch?v=16eMswT2x2Y
>
> More inline:
>
> On 06/05/2021 09:54, Martin Van Aken wrote:
>> Hi!
>> I'm Martin, I'm a software developer new to the Triples/SPARQL world. 
>> I'm
>> currently building queries against a Fuseki/TDB backend (that I can 
>> work on
>> too) and I'm getting into significant performance problems (including 
>> never
>> ending queries).
>
> Are updates also happening at the same time?
>
>> Despite what I thought was a good search on the apache
>> jena website I could not find a lot of insight about performance
>> investigation so I'm trying it here.
>>
>> Most of my data experience comes from the relational world (ex: PG) 
>> so I'm
>> sometimes drawing comparisons there.
>>
>> To give some context my data set is around 15 linked concepts, with the
>> number of triples for each ranging from some hundreds to 500K - total 
>> less
>> than 2 millions (documents/authors/publication kind of data).
>>
>> Unto questions:
>>
>>     - When I'm facing a slow query, what are my investigation 
>> options. Is
>>     there an equivalent of an "explain plan" in SQL pointing to the 
>> query
>>     specific slow points? What's the advised way for performance 
>> checks in
>>     SPARQL?
>
> qparse --print=opt --file query.rq
>
>>     - Are there any performance setups to be aware of on the server 
>> side?
>>     Like ways to check indexes are correctly built (outside of text 
>> search that
>>     I'm not working with for the moment)
>>     - We're currently using TDB1. I've seen the transactional 
>> benefits of
>>     TDB2 - are there performance improvements too that would warrant a
>>     migration there ?
>
> Not on the query side.
>
>    Andy
>
>>
>> Thanks a lot already!
>>
>> Martin
>>


Re: Jena / Fuseki / SPARQL performance (new to the tech)

Posted by Andy Seaborne <an...@apache.org>.
Hi there,

Showing the query would be helpful but some general remarks:

1/ If the query or the setup for Fuseki is needing more than the default 
heap size, then it might be that the Java JVM is getting into a state of 
heap exhaustion. This manifests as the CPU loading getting very high. It 
will seem like nothing is happening (waiting for response).

2/ The query may be expensive.

Things to look for
* cross products - two parts of the query pattern that are not
connected.

{ ?s ?p ?o . ?a ?b ?c } is N-squared the size of the database.

* sort, spilling to disk or combined with a cross product the query.

3/ If no results are coming back, then the query is form that does not 
stream back - sort, or CONSTRUCT maybe.

There was a useful presentation recently that talks about the principles 
of query efficiency.

SPARQL Query Optimization with Pavel Klinov
https://www.youtube.com/watch?v=16eMswT2x2Y

More inline:

On 06/05/2021 09:54, Martin Van Aken wrote:
> Hi!
> I'm Martin, I'm a software developer new to the Triples/SPARQL world. I'm
> currently building queries against a Fuseki/TDB backend (that I can work on
> too) and I'm getting into significant performance problems (including never
> ending queries).

Are updates also happening at the same time?

> Despite what I thought was a good search on the apache
> jena website I could not find a lot of insight about performance
> investigation so I'm trying it here.
> 
> Most of my data experience comes from the relational world (ex: PG) so I'm
> sometimes drawing comparisons there.
> 
> To give some context my data set is around 15 linked concepts, with the
> number of triples for each ranging from some hundreds to 500K - total less
> than 2 millions (documents/authors/publication kind of data).
> 
> Unto questions:
> 
>     - When I'm facing a slow query, what are my investigation options. Is
>     there an equivalent of an "explain plan" in SQL pointing to the query
>     specific slow points? What's the advised way for performance checks in
>     SPARQL?

qparse --print=opt --file query.rq

>     - Are there any performance setups to be aware of on the server side?
>     Like ways to check indexes are correctly built (outside of text search that
>     I'm not working with for the moment)
>     - We're currently using TDB1. I've seen the transactional benefits of
>     TDB2 - are there performance improvements too that would warrant a
>     migration there ?

Not on the query side.

     Andy

> 
> Thanks a lot already!
> 
> Martin
>