You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Cristóbal Miranda <cr...@gmail.com> on 2021/07/28 18:39:24 UTC

How to set a timeout for QueryExecution correctly?

Hello everyone,

I'm trying to run a sequence of queries with TDB, using a
locally loaded dataset. I don't want to wait more than a few
seconds for each query to finish. My attempt to do this looks like
the following:

>
> try {
>   RDFConnection rdfConnection = RDFConnectionFactory.connect(dataset);
>   QueryExecution queryExecution = rdfConnection.query(query);
>   queryExecution.setTimeout(timeoutMilliseconds);
>   ResultSet resultSet = queryExecution.execSelect();
>   while (resultSet.hasNext()) {
>     QuerySolution querySolution = resultSet.next();
>     ...
>   }
> } catch (QueryCancelledException e) {
>    ...
> }


The problem is that this is not working. With htop I see that
the process gets stuck in disk operations. One of the queries
took about 2 hours with the code above. An idea would
be trying to run this in a new thread and stopping the thread outside once
the timeout is reached, but I'm almost sure this wouldn't be a safe
way to stop the processing, even if it worked.

Is there a better way to do this?

Cristobal

Re: How to set a timeout for QueryExecution correctly?

Posted by Andy Seaborne <an...@apache.org>.
Thank you!


On 03/08/2021 22:07, Cristóbal Miranda wrote:
> Hi Andy,
> 
> Q1 seems to have the same problem, I created a script to generate data for
> it:
> https://github.com/CristobalM/jena-query-timeout/blob/main/scripts/generate_q1_optional.py
> 
> It gets stuck in getPlan() when there are no matches for the OPTIONAL
> pattern
> having ?var5. I tried first with data having that and the timeout worked,
> for example,
> at 2 seconds.
> 
> I'll send you the Visualvm snapshot,


> 
> 
> On Mon, 2 Aug 2021 at 06:47, Andy Seaborne <an...@apache.org> wrote:
> 
>> Hi there,
>>
>> Good catch - I can make that happen on my system.
>>
>> Recorded as JENA-2139
>>
>> The fix looks quite simple - delay making the index table until first
>> use - which works from testing so far. PR soon.
>>
>> That explains Q2 and Q3 but not Q1 (no MINUS!).
>>
>> The same work-during-plan building may be happening - not in the group
>> (that is a delayed iterator).
>>
>> If Q1 returns zero results, then the problem isn't in the sorting code
>> code itself (that can spend a long time during which it is not reading
>> query iterators but sorting zero rows is quite quick) but could be in
>> the need to evaluate the pattern. e.g.  FILTER ( ! bound(?var5) ) is
>> always false.
>>
>> Removing parts of the query and seeing what happens may point to the
>> culprit.
>>
>>   > and got the result in the attachment.
>>
>> Sorry - no attachments on the list. If you still have it, could you send
>> it directly to me please?
>>
>>       Andy
>>
>> https://issues.apache.org/jira/browse/JENA-2139
>> <https://issues.apache.org/jira/browse/JENA-2139>
>>
>> On 02/08/2021 03:56, Cristóbal Miranda wrote:
>>> Hi Andy,
>>>
>>> Thanks for the clarifications.
>>>
>>> I have a program to run a query with timeout from a tdb dataset
>>> and a script to generate an nt file for the second query. Because it is
>>> the simplest among the three, I decided to start with that one. The code
>>> is here https://github.com/CristobalM/jena-query-timeout
>>> <https://github.com/CristobalM/jena-query-timeout>.
>>>
>>> I think I have spotted a problem using that java code. I put that into a
>>> new module in the jena project and modified QueryExecutionBase with
>>> some prints to see that everything is being called. My initial guess
>>> to do
>>> this was that there is a concurrency problem with the TimeoutCallback
>>> and cancel() was never being called on the iterator and from what I saw
>>> that seems to be the case.
>>>
>>> So what happens is that queryIterator = getPlan().iterator()
>>> at QueryExecutionBase:558 is taking some time to finish and until then,
>>> queryIterator will remain null. Then, if the timeout occurs before that,
>>> queryIterator.cancel() won't be called because to do that there is
>>> a check that queryIterator is not null (QueryExecutionBase:189).
>>>
>>> Using `python3 generate_minus_1.py dataset.nt 10000000 9000000`
>>> for the dataset (and loaded with tdb.tdbloader2) and setting a timeout
>>> of 30 seconds, the iterator is not cancelled, but with 60 seconds it is
>>> cancelled.
>>>
>>> I tried using 120 seconds of timeout for the full wikidata dataset and
>>> the Q1
>>> won't stop after that time, so I can't be sure if the same is happening
>>> there. I will continue with that query and see if it is the same and
>>> let you know what happens.
>>>
>>> Also, I did a sampling with VisualVM, using the Q2 and the dataset
>>> mentioned above and got the result in the attachment.
>>>
>>>
>>>
>>>
>>> On Fri, 30 Jul 2021 at 16:33, Andy Seaborne <andy@apache.org
>>> <ma...@apache.org>> wrote:
>>>
>>>
>>>
>>>      On 30/07/2021 18:04, Cristóbal Miranda wrote:
>>>      >> For Q3, does "LIMIT 1" make any difference?
>>>      >
>>>      >
>>>      > No, I'm trying it now and it is taking several minutes at least.
>>>      >
>>>      >
>>>      >> except different cardinality. Because ?var5 isn't used
>>>      eslewhere, is
>>>      >> that waht was meant here?
>>>      >
>>>      >
>>>      > I'm also not too sure about the meaning of the queries in
>>>      general, I'm just
>>>      > using
>>>      > a large dataset of anonymized queries. I took one of the
>>>      > files from here
>>>      https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
>>>      <https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en>
>>>      > (they say those are from Wikidata's users) and executed a script
>>>      to filter
>>>      > out some types of queries, the result of that is a file with 27
>>>      million
>>>      > queries.
>>>      > But to answer your question about this specific query, since this
>> is
>>>      > Wikidata,
>>>      > what I'd do is to run the query on https://query.wikidata.org/
>>>      <https://query.wikidata.org/>, and infer
>>>      > the meaning
>>>      > of the variables, ?var1 is some person and P570 date of death,
>>>      etc... and
>>>      > from the filters it seems
>>>      > that they want to find those who died in the last 30 days and
>>>      were born
>>>      > after 1954. ?var5
>>>      > usage seems to indicate that they want to find those whose cause
>>>      of death
>>>      > is not
>>>      > known (because it's an OPTIONAL and !BOUND would filter out
>>>      those where
>>>      > there is a cause of death).
>>>      > In this case it is reasonable that I get no results because the
>>>      dataset I'm
>>>      > using is older than 30 days.
>>>      > So yes, you are right, they are equivalent for this query and
>>>      with NOT
>>>      > EXISTS would
>>>      > probably be faster. But still, what I'd like to have is a way to
>>>      stop the
>>>      > processing
>>>      > after a few seconds consistently, independent of the query.
>>>      >
>>>      > I just did quick peek at the code and found that, correct me if
>>>      I'm wrong,
>>>      > the cancellation of
>>>      > a query is propagated in a top down manner for the
>>>      QueryIterators, in a way
>>>      > that to notice
>>>      > that the query was aborted you have to get the next result (I'm
>>>      looking
>>>      > specifically at QueryIterRepeatApply.java:65),
>>>      > meanwhile the query might be stuck at some heavy processing in a
>>>      lower
>>>      > level, which might be the case for
>>>      > nested OPTIONALs.
>>>
>>>      Yes - that the mechanism but nested OPTIONALs are optimized to
>>>      "(condition)" which incrementally reads iterators.
>>>
>>>      In a query plan, ("qparse --print=opt") it shows (condition) not
>>>      (leftjoin).
>>>
>>>      The (minus) operator may be reading all one side and then missing the
>>>      timeout but it is still advancing one side (the left)
>>>
>>>      So trying the FILTER NOT EXISTS version of one of the problem queries
>>>      will give us some help in identifying if the problem is MINUS or not.
>>>
>>>      > In the case of QueryIterMinus I see that in the constructor you
>>>      have to
>>>      > create the index for the right iterator, but
>>>      > that index creation is not stoppable. (On a side note, did you
>>>      try doing
>>>      > QueryIterMinus similar to QueryIterOptionalIndex,
>>>      > using QueryIterRepeatApply? I just have the impression that
>>>      running that
>>>      > subquery with left values bound and checking
>>>      > for only one result could be faster, but this is just a
>>>      suggestion in case
>>>      > you haven't tried, don't take this
>>>      > too seriously).
>>>
>>>      The right-hand-side index building is once per operator so gets
>>>      reuses
>>>      across the whole of traversing the left.
>>>
>>>      IndexFactory.create is reading a query iterator.
>>>
>>>      So on the surface, query iterators are being read and the time
>>>      should be
>>>      seen.
>>>
>>>      Unless it hits the LinearIndex case which looks expensive.
>>>
>>>      >
>>>      > If you consider that those are things to be fixed and don't have
>>>      it in high
>>>      > priority I could give it a try and send a PR,
>>>      > if it's ok with you.
>>>
>>>      That would be excellent but let's establish the problem first to
>>>      know it
>>>      is going to improve things.
>>>
>>>      If you can point VisualVM/YourKit/... at a slow query and identify
>>>      the
>>>      code hot spots, that would be perfect.
>>>
>>>      A small example dataset would also be good so we have a minimal,
>>>      executable example.
>>>
>>>      >
>>>      > Sorry if I seem too intrusive here, but I'm really interested in
>>>      having the
>>>      > timeout work well for any query!
>>>
>>>      Not at all - it is part of the point of open source.
>>>
>>>           Andy
>>>
>>>      >
>>>      >
>>>      >
>>>      > On Fri, 30 Jul 2021 at 04:47, Andy Seaborne <andy@apache.org
>>>      <ma...@apache.org>> wrote:
>>>      >
>>>      >> Hi Cristóbal,
>>>      >>
>>>      >> I can't tell from just the queries.
>>>      >>
>>>      >> You say the first query has zero results so it is not the ORDER
>> BY.
>>>      >>
>>>      >> The queries themselves may be inefficient and changing the
>>>      order may
>>>      >> have a significant effect (all data dependent).
>>>      >>
>>>      >> All three do not use FILTER NOT EXISTS where they could. That
>>>      may well
>>>      >> be faster.
>>>      >>
>>>      >> Q1 uses OPTIONAL !BOUND (see inline), Q2 and Q3 use MINUS.
>>>      >>
>>>      >> For Q3, does "LIMIT 1" make any difference?
>>>      >>
>>>      >>       Andy
>>>      >>
>>>      >> On 29/07/2021 15:57, Cristóbal Miranda wrote:
>>>      >>> Hi Andy,
>>>      >>>
>>>      >>> I have three queries in which this happens:
>>>      >>>
>>>      >>> 1.
>>>      >>>
>>>      >>> SELECT ?var1  ?var2  ?var3  ?var4
>>>      >>>> WHERE {
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570
>>>      <http://www.wikidata.org/prop/direct/P570>> ?var2 .
>>>      >>>>    FILTER (  ( (  ?var2  > "2016-07-30T00:00:00Z"^^<
>>>      >>>> http://www.w3.org/2001/XMLSchema#dateTime
>>>      <http://www.w3.org/2001/XMLSchema#dateTime>> ) )
>>>      >>>> ) .
>>>      >>>>    FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
>>>      >>>> http://www.w3.org/2001/XMLSchema#duration
>>>      <http://www.w3.org/2001/XMLSchema#duration>> )  ) && (  ?var2  <
>> NOW
>>>      >> (  )
>>>      >>>>    ) )
>>>      >>>> ) .
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31
>>>      <http://www.wikidata.org/prop/direct/P31>> <
>>>      >>>> http://www.wikidata.org/entity/Q5
>>>      <http://www.wikidata.org/entity/Q5>> .
>>>      >>
>>>      >> ---------
>>>      >>>>    OPTIONAL {
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1196
>>>      <http://www.wikidata.org/prop/direct/P1196>> ?var5 .
>>>      >>>>    }
>>>      >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
>>>      >>
>>>      >> same as
>>>      >>
>>>      >> FILTER
>>>      >> NOT EXISTS { ?var1 <http://www.wikidata.org/prop/direct/P1196
>>>      <http://www.wikidata.org/prop/direct/P1196>> ?var5 }
>>>      >>
>>>      >> except different cardinality. Because ?var5 isn't used
>>>      eslewhere, is
>>>      >> that waht was meant here?
>>>      >>
>>>      >>>> ) .
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569
>>>      <http://www.wikidata.org/prop/direct/P569>> ?var6 .
>>>      >>>>    FILTER (  ( (  ?var6  > "1954-12-31T00:00:00Z"^^<
>>>      >>>> http://www.w3.org/2001/XMLSchema#dateTime
>>>      <http://www.w3.org/2001/XMLSchema#dateTime>> ) )
>>>      >>>> ) .
>>>      >>>>    OPTIONAL {
>>>      >>>>     ?var1  <http://wikiba.se/ontology#statements
>>>      <http://wikiba.se/ontology#statements>> ?var3 .
>>>      >>>>    }
>>>      >>>>    OPTIONAL {
>>>      >>>>     ?var1  <http://wikiba.se/ontology#sitelinks
>>>      <http://wikiba.se/ontology#sitelinks>> ?var4 .
>>>      >>>>    }
>>>      >>>> }
>>>      >>>> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
>>>      >>>>
>>>      >>>>
>>>      >>> 2.
>>>      >>>
>>>      >>>> SELECT ?var1  ?var2
>>>      >>>> WHERE {
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1889
>>>      <http://www.wikidata.org/prop/direct/P1889>> ?var2 .
>>>      >>>>     MINUS   {
>>>      >>>>       ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type
>>>      <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> <
>>>      >>>> http://wikiba.se/ontology#Property
>>>      <http://wikiba.se/ontology#Property>> .
>>>      >>>>     }
>>>      >>>> }
>>>      >>>>
>>>      >>>>
>>>      >>> 3.
>>>      >>>
>>>      >>>> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
>>>      >>>> WHERE {
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569
>>>      <http://www.wikidata.org/prop/direct/P569>> ?var3 .
>>>      >>>>    FILTER (  ( (  ?var3  > "1956-01-01T00:00:00Z"^^<
>>>      >>>> http://www.w3.org/2001/XMLSchema#dateTime
>>>      <http://www.w3.org/2001/XMLSchema#dateTime>> ) && (  ?var3  <
>>>      >>>>
>>>      "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime
>>>      <http://www.w3.org/2001/XMLSchema#dateTime>>
>>>      >> ) )
>>>      >>>> ) .
>>>      >>>>    OPTIONAL {
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570
>>>      <http://www.wikidata.org/prop/direct/P570>> ?var5 .
>>>      >>>>    }
>>>      >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
>>>      >>>> ) .
>>>      >>>>     ?var2  <http://www.wikidata.org/prop/direct/P569
>>>      <http://www.wikidata.org/prop/direct/P569>> ?var3 .
>>>      >>>>    OPTIONAL {
>>>      >>>>     ?var2  <http://www.wikidata.org/prop/direct/P570
>>>      <http://www.wikidata.org/prop/direct/P570>> ?var6 .
>>>      >>>>    }
>>>      >>>>    FILTER (  ( !( BOUND (  ?var6  ) ) )
>>>      >>>> ) .
>>>      >>>>     ?var1  <http://www.w3.org/2000/01/rdf-schema#label
>>>      <http://www.w3.org/2000/01/rdf-schema#label>> ?var4 .
>>>      >>>>     ?var2  <http://www.w3.org/2000/01/rdf-schema#label
>>>      <http://www.w3.org/2000/01/rdf-schema#label>> ?var4 .
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31
>>>      <http://www.wikidata.org/prop/direct/P31>> <
>>>      >>>> http://www.wikidata.org/entity/Q5
>>>      <http://www.wikidata.org/entity/Q5>> .
>>>      >>>>     ?var2  <http://www.wikidata.org/prop/direct/P31
>>>      <http://www.wikidata.org/prop/direct/P31>> <
>>>      >>>> http://www.wikidata.org/entity/Q5
>>>      <http://www.wikidata.org/entity/Q5>> .
>>>      >>>>     ?var1  <http://www.wikidata.org/prop/direct/P21
>>>      <http://www.wikidata.org/prop/direct/P21>> ?var7 .
>>>      >>>>     ?var2  <http://www.wikidata.org/prop/direct/P21
>>>      <http://www.wikidata.org/prop/direct/P21>> ?var7 .
>>>      >>>>    FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
>>>      >>>> ) .
>>>      >>>>     MINUS   {
>>>      >>>>       ?var1  <http://www.wikidata.org/prop/direct/P7
>>>      <http://www.wikidata.org/prop/direct/P7>> ?var2 .
>>>      >>>>     }
>>>      >>>>     MINUS   {
>>>      >>>>       ?var1  <http://www.wikidata.org/prop/direct/P9
>>>      <http://www.wikidata.org/prop/direct/P9>> ?var2 .
>>>      >>>>     }
>>>      >>>>     MINUS   {
>>>      >>>>       ?var1  <http://www.wikidata.org/prop/direct/P1889
>>>      <http://www.wikidata.org/prop/direct/P1889>> ?var2 .
>>>      >>>>     }
>>>      >>>>     MINUS   {
>>>      >>>>       ?var1  <http://www.wikidata.org/prop/direct/P460
>>>      <http://www.wikidata.org/prop/direct/P460>> ?var2 .
>>>      >>>>     }
>>>      >>>> }
>>>      >>>> LIMIT 500
>>>      >>>
>>>      >>>
>>>      >>>
>>>      >>> The first one is taking 141 minutes,  the second 32 minutes
>>>      and the third
>>>      >>> is still running.
>>>      >>> I have run about 1200 queries, where 38 times the exception
>>>      was thrown,
>>>      >> but
>>>      >>> interestingly not
>>>      >>> in the first two queries, where I got 0 and 575925 results
>>>      respectively.
>>>      >>>
>>>      >>> I'm using jena 4.1.0.
>>>      >>>
>>>      >>>
>>>      >>>
>>>      >>> On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <andy@apache.org
>>>      <ma...@apache.org>> wrote:
>>>      >>>
>>>      >>>> Hi Cristóbal,
>>>      >>>>
>>>      >>>> What's the query and which version of jena is this?
>>>      >>>>
>>>      >>>>        Andy
>>>      >>>>
>>>      >>>> On 28/07/2021 19:39, Cristóbal Miranda wrote:
>>>      >>>>> Hello everyone,
>>>      >>>>>
>>>      >>>>> I'm trying to run a sequence of queries with TDB, using a
>>>      >>>>> locally loaded dataset. I don't want to wait more than a few
>>>      >>>>> seconds for each query to finish. My attempt to do this
>>>      looks like
>>>      >>>>> the following:
>>>      >>>>>
>>>      >>>>>>
>>>      >>>>>> try {
>>>      >>>>>>      RDFConnection rdfConnection =
>>>      >> RDFConnectionFactory.connect(dataset);
>>>      >>>>>>      QueryExecution queryExecution =
>>>      rdfConnection.query(query);
>>>      >>>>>> queryExecution.setTimeout(timeoutMilliseconds);
>>>      >>>>>>      ResultSet resultSet = queryExecution.execSelect();
>>>      >>>>>>      while (resultSet.hasNext()) {
>>>      >>>>>>        QuerySolution querySolution = resultSet.next();
>>>      >>>>>>        ...
>>>      >>>>>>      }
>>>      >>>>>> } catch (QueryCancelledException e) {
>>>      >>>>>>       ...
>>>      >>>>>> }
>>>      >>>>>
>>>      >>>>>
>>>      >>>>> The problem is that this is not working. With htop I see that
>>>      >>>>> the process gets stuck in disk operations. One of the queries
>>>      >>>>> took about 2 hours with the code above. An idea would
>>>      >>>>> be trying to run this in a new thread and stopping the
>>>      thread outside
>>>      >>>> once
>>>      >>>>> the timeout is reached, but I'm almost sure this wouldn't be
>>>      a safe
>>>      >>>>> way to stop the processing, even if it worked.
>>>      >>>>>
>>>      >>>>> Is there a better way to do this?
>>>      >>>>>
>>>      >>>>> Cristobal
>>>      >>>>>
>>>      >>>>
>>>      >>>
>>>      >>
>>>      >
>>>
>>
> 

Re: How to set a timeout for QueryExecution correctly?

Posted by Cristóbal Miranda <cr...@gmail.com>.
Hi Andy,

Q1 seems to have the same problem, I created a script to generate data for
it:
https://github.com/CristobalM/jena-query-timeout/blob/main/scripts/generate_q1_optional.py

It gets stuck in getPlan() when there are no matches for the OPTIONAL
pattern
having ?var5. I tried first with data having that and the timeout worked,
for example,
at 2 seconds.

I'll send you the Visualvm snapshot,


On Mon, 2 Aug 2021 at 06:47, Andy Seaborne <an...@apache.org> wrote:

> Hi there,
>
> Good catch - I can make that happen on my system.
>
> Recorded as JENA-2139
>
> The fix looks quite simple - delay making the index table until first
> use - which works from testing so far. PR soon.
>
> That explains Q2 and Q3 but not Q1 (no MINUS!).
>
> The same work-during-plan building may be happening - not in the group
> (that is a delayed iterator).
>
> If Q1 returns zero results, then the problem isn't in the sorting code
> code itself (that can spend a long time during which it is not reading
> query iterators but sorting zero rows is quite quick) but could be in
> the need to evaluate the pattern. e.g.  FILTER ( ! bound(?var5) ) is
> always false.
>
> Removing parts of the query and seeing what happens may point to the
> culprit.
>
>  > and got the result in the attachment.
>
> Sorry - no attachments on the list. If you still have it, could you send
> it directly to me please?
>
>      Andy
>
> https://issues.apache.org/jira/browse/JENA-2139
> <https://issues.apache.org/jira/browse/JENA-2139>
>
> On 02/08/2021 03:56, Cristóbal Miranda wrote:
> > Hi Andy,
> >
> > Thanks for the clarifications.
> >
> > I have a program to run a query with timeout from a tdb dataset
> > and a script to generate an nt file for the second query. Because it is
> > the simplest among the three, I decided to start with that one. The code
> > is here https://github.com/CristobalM/jena-query-timeout
> > <https://github.com/CristobalM/jena-query-timeout>.
> >
> > I think I have spotted a problem using that java code. I put that into a
> > new module in the jena project and modified QueryExecutionBase with
> > some prints to see that everything is being called. My initial guess
> > to do
> > this was that there is a concurrency problem with the TimeoutCallback
> > and cancel() was never being called on the iterator and from what I saw
> > that seems to be the case.
> >
> > So what happens is that queryIterator = getPlan().iterator()
> > at QueryExecutionBase:558 is taking some time to finish and until then,
> > queryIterator will remain null. Then, if the timeout occurs before that,
> > queryIterator.cancel() won't be called because to do that there is
> > a check that queryIterator is not null (QueryExecutionBase:189).
> >
> > Using `python3 generate_minus_1.py dataset.nt 10000000 9000000`
> > for the dataset (and loaded with tdb.tdbloader2) and setting a timeout
> > of 30 seconds, the iterator is not cancelled, but with 60 seconds it is
> > cancelled.
> >
> > I tried using 120 seconds of timeout for the full wikidata dataset and
> > the Q1
> > won't stop after that time, so I can't be sure if the same is happening
> > there. I will continue with that query and see if it is the same and
> > let you know what happens.
> >
> > Also, I did a sampling with VisualVM, using the Q2 and the dataset
> > mentioned above and got the result in the attachment.
> >
> >
> >
> >
> > On Fri, 30 Jul 2021 at 16:33, Andy Seaborne <andy@apache.org
> > <ma...@apache.org>> wrote:
> >
> >
> >
> >     On 30/07/2021 18:04, Cristóbal Miranda wrote:
> >     >> For Q3, does "LIMIT 1" make any difference?
> >     >
> >     >
> >     > No, I'm trying it now and it is taking several minutes at least.
> >     >
> >     >
> >     >> except different cardinality. Because ?var5 isn't used
> >     eslewhere, is
> >     >> that waht was meant here?
> >     >
> >     >
> >     > I'm also not too sure about the meaning of the queries in
> >     general, I'm just
> >     > using
> >     > a large dataset of anonymized queries. I took one of the
> >     > files from here
> >     https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
> >     <https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en>
> >     > (they say those are from Wikidata's users) and executed a script
> >     to filter
> >     > out some types of queries, the result of that is a file with 27
> >     million
> >     > queries.
> >     > But to answer your question about this specific query, since this
> is
> >     > Wikidata,
> >     > what I'd do is to run the query on https://query.wikidata.org/
> >     <https://query.wikidata.org/>, and infer
> >     > the meaning
> >     > of the variables, ?var1 is some person and P570 date of death,
> >     etc... and
> >     > from the filters it seems
> >     > that they want to find those who died in the last 30 days and
> >     were born
> >     > after 1954. ?var5
> >     > usage seems to indicate that they want to find those whose cause
> >     of death
> >     > is not
> >     > known (because it's an OPTIONAL and !BOUND would filter out
> >     those where
> >     > there is a cause of death).
> >     > In this case it is reasonable that I get no results because the
> >     dataset I'm
> >     > using is older than 30 days.
> >     > So yes, you are right, they are equivalent for this query and
> >     with NOT
> >     > EXISTS would
> >     > probably be faster. But still, what I'd like to have is a way to
> >     stop the
> >     > processing
> >     > after a few seconds consistently, independent of the query.
> >     >
> >     > I just did quick peek at the code and found that, correct me if
> >     I'm wrong,
> >     > the cancellation of
> >     > a query is propagated in a top down manner for the
> >     QueryIterators, in a way
> >     > that to notice
> >     > that the query was aborted you have to get the next result (I'm
> >     looking
> >     > specifically at QueryIterRepeatApply.java:65),
> >     > meanwhile the query might be stuck at some heavy processing in a
> >     lower
> >     > level, which might be the case for
> >     > nested OPTIONALs.
> >
> >     Yes - that the mechanism but nested OPTIONALs are optimized to
> >     "(condition)" which incrementally reads iterators.
> >
> >     In a query plan, ("qparse --print=opt") it shows (condition) not
> >     (leftjoin).
> >
> >     The (minus) operator may be reading all one side and then missing the
> >     timeout but it is still advancing one side (the left)
> >
> >     So trying the FILTER NOT EXISTS version of one of the problem queries
> >     will give us some help in identifying if the problem is MINUS or not.
> >
> >     > In the case of QueryIterMinus I see that in the constructor you
> >     have to
> >     > create the index for the right iterator, but
> >     > that index creation is not stoppable. (On a side note, did you
> >     try doing
> >     > QueryIterMinus similar to QueryIterOptionalIndex,
> >     > using QueryIterRepeatApply? I just have the impression that
> >     running that
> >     > subquery with left values bound and checking
> >     > for only one result could be faster, but this is just a
> >     suggestion in case
> >     > you haven't tried, don't take this
> >     > too seriously).
> >
> >     The right-hand-side index building is once per operator so gets
> >     reuses
> >     across the whole of traversing the left.
> >
> >     IndexFactory.create is reading a query iterator.
> >
> >     So on the surface, query iterators are being read and the time
> >     should be
> >     seen.
> >
> >     Unless it hits the LinearIndex case which looks expensive.
> >
> >     >
> >     > If you consider that those are things to be fixed and don't have
> >     it in high
> >     > priority I could give it a try and send a PR,
> >     > if it's ok with you.
> >
> >     That would be excellent but let's establish the problem first to
> >     know it
> >     is going to improve things.
> >
> >     If you can point VisualVM/YourKit/... at a slow query and identify
> >     the
> >     code hot spots, that would be perfect.
> >
> >     A small example dataset would also be good so we have a minimal,
> >     executable example.
> >
> >     >
> >     > Sorry if I seem too intrusive here, but I'm really interested in
> >     having the
> >     > timeout work well for any query!
> >
> >     Not at all - it is part of the point of open source.
> >
> >          Andy
> >
> >     >
> >     >
> >     >
> >     > On Fri, 30 Jul 2021 at 04:47, Andy Seaborne <andy@apache.org
> >     <ma...@apache.org>> wrote:
> >     >
> >     >> Hi Cristóbal,
> >     >>
> >     >> I can't tell from just the queries.
> >     >>
> >     >> You say the first query has zero results so it is not the ORDER
> BY.
> >     >>
> >     >> The queries themselves may be inefficient and changing the
> >     order may
> >     >> have a significant effect (all data dependent).
> >     >>
> >     >> All three do not use FILTER NOT EXISTS where they could. That
> >     may well
> >     >> be faster.
> >     >>
> >     >> Q1 uses OPTIONAL !BOUND (see inline), Q2 and Q3 use MINUS.
> >     >>
> >     >> For Q3, does "LIMIT 1" make any difference?
> >     >>
> >     >>       Andy
> >     >>
> >     >> On 29/07/2021 15:57, Cristóbal Miranda wrote:
> >     >>> Hi Andy,
> >     >>>
> >     >>> I have three queries in which this happens:
> >     >>>
> >     >>> 1.
> >     >>>
> >     >>> SELECT ?var1  ?var2  ?var3  ?var4
> >     >>>> WHERE {
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570
> >     <http://www.wikidata.org/prop/direct/P570>> ?var2 .
> >     >>>>    FILTER (  ( (  ?var2  > "2016-07-30T00:00:00Z"^^<
> >     >>>> http://www.w3.org/2001/XMLSchema#dateTime
> >     <http://www.w3.org/2001/XMLSchema#dateTime>> ) )
> >     >>>> ) .
> >     >>>>    FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
> >     >>>> http://www.w3.org/2001/XMLSchema#duration
> >     <http://www.w3.org/2001/XMLSchema#duration>> )  ) && (  ?var2  <
> NOW
> >     >> (  )
> >     >>>>    ) )
> >     >>>> ) .
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31
> >     <http://www.wikidata.org/prop/direct/P31>> <
> >     >>>> http://www.wikidata.org/entity/Q5
> >     <http://www.wikidata.org/entity/Q5>> .
> >     >>
> >     >> ---------
> >     >>>>    OPTIONAL {
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1196
> >     <http://www.wikidata.org/prop/direct/P1196>> ?var5 .
> >     >>>>    }
> >     >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
> >     >>
> >     >> same as
> >     >>
> >     >> FILTER
> >     >> NOT EXISTS { ?var1 <http://www.wikidata.org/prop/direct/P1196
> >     <http://www.wikidata.org/prop/direct/P1196>> ?var5 }
> >     >>
> >     >> except different cardinality. Because ?var5 isn't used
> >     eslewhere, is
> >     >> that waht was meant here?
> >     >>
> >     >>>> ) .
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569
> >     <http://www.wikidata.org/prop/direct/P569>> ?var6 .
> >     >>>>    FILTER (  ( (  ?var6  > "1954-12-31T00:00:00Z"^^<
> >     >>>> http://www.w3.org/2001/XMLSchema#dateTime
> >     <http://www.w3.org/2001/XMLSchema#dateTime>> ) )
> >     >>>> ) .
> >     >>>>    OPTIONAL {
> >     >>>>     ?var1  <http://wikiba.se/ontology#statements
> >     <http://wikiba.se/ontology#statements>> ?var3 .
> >     >>>>    }
> >     >>>>    OPTIONAL {
> >     >>>>     ?var1  <http://wikiba.se/ontology#sitelinks
> >     <http://wikiba.se/ontology#sitelinks>> ?var4 .
> >     >>>>    }
> >     >>>> }
> >     >>>> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
> >     >>>>
> >     >>>>
> >     >>> 2.
> >     >>>
> >     >>>> SELECT ?var1  ?var2
> >     >>>> WHERE {
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1889
> >     <http://www.wikidata.org/prop/direct/P1889>> ?var2 .
> >     >>>>     MINUS   {
> >     >>>>       ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type
> >     <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> <
> >     >>>> http://wikiba.se/ontology#Property
> >     <http://wikiba.se/ontology#Property>> .
> >     >>>>     }
> >     >>>> }
> >     >>>>
> >     >>>>
> >     >>> 3.
> >     >>>
> >     >>>> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
> >     >>>> WHERE {
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569
> >     <http://www.wikidata.org/prop/direct/P569>> ?var3 .
> >     >>>>    FILTER (  ( (  ?var3  > "1956-01-01T00:00:00Z"^^<
> >     >>>> http://www.w3.org/2001/XMLSchema#dateTime
> >     <http://www.w3.org/2001/XMLSchema#dateTime>> ) && (  ?var3  <
> >     >>>>
> >     "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime
> >     <http://www.w3.org/2001/XMLSchema#dateTime>>
> >     >> ) )
> >     >>>> ) .
> >     >>>>    OPTIONAL {
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570
> >     <http://www.wikidata.org/prop/direct/P570>> ?var5 .
> >     >>>>    }
> >     >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
> >     >>>> ) .
> >     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P569
> >     <http://www.wikidata.org/prop/direct/P569>> ?var3 .
> >     >>>>    OPTIONAL {
> >     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P570
> >     <http://www.wikidata.org/prop/direct/P570>> ?var6 .
> >     >>>>    }
> >     >>>>    FILTER (  ( !( BOUND (  ?var6  ) ) )
> >     >>>> ) .
> >     >>>>     ?var1  <http://www.w3.org/2000/01/rdf-schema#label
> >     <http://www.w3.org/2000/01/rdf-schema#label>> ?var4 .
> >     >>>>     ?var2  <http://www.w3.org/2000/01/rdf-schema#label
> >     <http://www.w3.org/2000/01/rdf-schema#label>> ?var4 .
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31
> >     <http://www.wikidata.org/prop/direct/P31>> <
> >     >>>> http://www.wikidata.org/entity/Q5
> >     <http://www.wikidata.org/entity/Q5>> .
> >     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P31
> >     <http://www.wikidata.org/prop/direct/P31>> <
> >     >>>> http://www.wikidata.org/entity/Q5
> >     <http://www.wikidata.org/entity/Q5>> .
> >     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P21
> >     <http://www.wikidata.org/prop/direct/P21>> ?var7 .
> >     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P21
> >     <http://www.wikidata.org/prop/direct/P21>> ?var7 .
> >     >>>>    FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
> >     >>>> ) .
> >     >>>>     MINUS   {
> >     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P7
> >     <http://www.wikidata.org/prop/direct/P7>> ?var2 .
> >     >>>>     }
> >     >>>>     MINUS   {
> >     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P9
> >     <http://www.wikidata.org/prop/direct/P9>> ?var2 .
> >     >>>>     }
> >     >>>>     MINUS   {
> >     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P1889
> >     <http://www.wikidata.org/prop/direct/P1889>> ?var2 .
> >     >>>>     }
> >     >>>>     MINUS   {
> >     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P460
> >     <http://www.wikidata.org/prop/direct/P460>> ?var2 .
> >     >>>>     }
> >     >>>> }
> >     >>>> LIMIT 500
> >     >>>
> >     >>>
> >     >>>
> >     >>> The first one is taking 141 minutes,  the second 32 minutes
> >     and the third
> >     >>> is still running.
> >     >>> I have run about 1200 queries, where 38 times the exception
> >     was thrown,
> >     >> but
> >     >>> interestingly not
> >     >>> in the first two queries, where I got 0 and 575925 results
> >     respectively.
> >     >>>
> >     >>> I'm using jena 4.1.0.
> >     >>>
> >     >>>
> >     >>>
> >     >>> On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <andy@apache.org
> >     <ma...@apache.org>> wrote:
> >     >>>
> >     >>>> Hi Cristóbal,
> >     >>>>
> >     >>>> What's the query and which version of jena is this?
> >     >>>>
> >     >>>>        Andy
> >     >>>>
> >     >>>> On 28/07/2021 19:39, Cristóbal Miranda wrote:
> >     >>>>> Hello everyone,
> >     >>>>>
> >     >>>>> I'm trying to run a sequence of queries with TDB, using a
> >     >>>>> locally loaded dataset. I don't want to wait more than a few
> >     >>>>> seconds for each query to finish. My attempt to do this
> >     looks like
> >     >>>>> the following:
> >     >>>>>
> >     >>>>>>
> >     >>>>>> try {
> >     >>>>>>      RDFConnection rdfConnection =
> >     >> RDFConnectionFactory.connect(dataset);
> >     >>>>>>      QueryExecution queryExecution =
> >     rdfConnection.query(query);
> >     >>>>>> queryExecution.setTimeout(timeoutMilliseconds);
> >     >>>>>>      ResultSet resultSet = queryExecution.execSelect();
> >     >>>>>>      while (resultSet.hasNext()) {
> >     >>>>>>        QuerySolution querySolution = resultSet.next();
> >     >>>>>>        ...
> >     >>>>>>      }
> >     >>>>>> } catch (QueryCancelledException e) {
> >     >>>>>>       ...
> >     >>>>>> }
> >     >>>>>
> >     >>>>>
> >     >>>>> The problem is that this is not working. With htop I see that
> >     >>>>> the process gets stuck in disk operations. One of the queries
> >     >>>>> took about 2 hours with the code above. An idea would
> >     >>>>> be trying to run this in a new thread and stopping the
> >     thread outside
> >     >>>> once
> >     >>>>> the timeout is reached, but I'm almost sure this wouldn't be
> >     a safe
> >     >>>>> way to stop the processing, even if it worked.
> >     >>>>>
> >     >>>>> Is there a better way to do this?
> >     >>>>>
> >     >>>>> Cristobal
> >     >>>>>
> >     >>>>
> >     >>>
> >     >>
> >     >
> >
>

Re: How to set a timeout for QueryExecution correctly?

Posted by Andy Seaborne <an...@apache.org>.
Hi there,

Good catch - I can make that happen on my system.

Recorded as JENA-2139

The fix looks quite simple - delay making the index table until first 
use - which works from testing so far. PR soon.

That explains Q2 and Q3 but not Q1 (no MINUS!).

The same work-during-plan building may be happening - not in the group 
(that is a delayed iterator).

If Q1 returns zero results, then the problem isn't in the sorting code 
code itself (that can spend a long time during which it is not reading 
query iterators but sorting zero rows is quite quick) but could be in 
the need to evaluate the pattern. e.g.  FILTER ( ! bound(?var5) ) is 
always false.

Removing parts of the query and seeing what happens may point to the 
culprit.

 > and got the result in the attachment.

Sorry - no attachments on the list. If you still have it, could you send 
it directly to me please?

     Andy

https://issues.apache.org/jira/browse/JENA-2139 
<https://issues.apache.org/jira/browse/JENA-2139>

On 02/08/2021 03:56, Cristóbal Miranda wrote:
> Hi Andy,
>
> Thanks for the clarifications.
>
> I have a program to run a query with timeout from a tdb dataset
> and a script to generate an nt file for the second query. Because it is
> the simplest among the three, I decided to start with that one. The code
> is here https://github.com/CristobalM/jena-query-timeout 
> <https://github.com/CristobalM/jena-query-timeout>.
>
> I think I have spotted a problem using that java code. I put that into a
> new module in the jena project and modified QueryExecutionBase with
> some prints to see that everything is being called. My initial guess 
> to do
> this was that there is a concurrency problem with the TimeoutCallback
> and cancel() was never being called on the iterator and from what I saw
> that seems to be the case.
>
> So what happens is that queryIterator = getPlan().iterator()
> at QueryExecutionBase:558 is taking some time to finish and until then,
> queryIterator will remain null. Then, if the timeout occurs before that,
> queryIterator.cancel() won't be called because to do that there is
> a check that queryIterator is not null (QueryExecutionBase:189).
>
> Using `python3 generate_minus_1.py dataset.nt 10000000 9000000`
> for the dataset (and loaded with tdb.tdbloader2) and setting a timeout
> of 30 seconds, the iterator is not cancelled, but with 60 seconds it is
> cancelled.
>
> I tried using 120 seconds of timeout for the full wikidata dataset and 
> the Q1
> won't stop after that time, so I can't be sure if the same is happening
> there. I will continue with that query and see if it is the same and
> let you know what happens.
>
> Also, I did a sampling with VisualVM, using the Q2 and the dataset
> mentioned above and got the result in the attachment.
>
>
>
>
> On Fri, 30 Jul 2021 at 16:33, Andy Seaborne <andy@apache.org 
> <ma...@apache.org>> wrote:
>
>
>
>     On 30/07/2021 18:04, Cristóbal Miranda wrote:
>     >> For Q3, does "LIMIT 1" make any difference?
>     >
>     >
>     > No, I'm trying it now and it is taking several minutes at least.
>     >
>     >
>     >> except different cardinality. Because ?var5 isn't used
>     eslewhere, is
>     >> that waht was meant here?
>     >
>     >
>     > I'm also not too sure about the meaning of the queries in
>     general, I'm just
>     > using
>     > a large dataset of anonymized queries. I took one of the
>     > files from here
>     https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
>     <https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en>
>     > (they say those are from Wikidata's users) and executed a script
>     to filter
>     > out some types of queries, the result of that is a file with 27
>     million
>     > queries.
>     > But to answer your question about this specific query, since this is
>     > Wikidata,
>     > what I'd do is to run the query on https://query.wikidata.org/
>     <https://query.wikidata.org/>, and infer
>     > the meaning
>     > of the variables, ?var1 is some person and P570 date of death,
>     etc... and
>     > from the filters it seems
>     > that they want to find those who died in the last 30 days and
>     were born
>     > after 1954. ?var5
>     > usage seems to indicate that they want to find those whose cause
>     of death
>     > is not
>     > known (because it's an OPTIONAL and !BOUND would filter out
>     those where
>     > there is a cause of death).
>     > In this case it is reasonable that I get no results because the
>     dataset I'm
>     > using is older than 30 days.
>     > So yes, you are right, they are equivalent for this query and
>     with NOT
>     > EXISTS would
>     > probably be faster. But still, what I'd like to have is a way to
>     stop the
>     > processing
>     > after a few seconds consistently, independent of the query.
>     >
>     > I just did quick peek at the code and found that, correct me if
>     I'm wrong,
>     > the cancellation of
>     > a query is propagated in a top down manner for the
>     QueryIterators, in a way
>     > that to notice
>     > that the query was aborted you have to get the next result (I'm
>     looking
>     > specifically at QueryIterRepeatApply.java:65),
>     > meanwhile the query might be stuck at some heavy processing in a
>     lower
>     > level, which might be the case for
>     > nested OPTIONALs.
>
>     Yes - that the mechanism but nested OPTIONALs are optimized to
>     "(condition)" which incrementally reads iterators.
>
>     In a query plan, ("qparse --print=opt") it shows (condition) not
>     (leftjoin).
>
>     The (minus) operator may be reading all one side and then missing the
>     timeout but it is still advancing one side (the left)
>
>     So trying the FILTER NOT EXISTS version of one of the problem queries
>     will give us some help in identifying if the problem is MINUS or not.
>
>     > In the case of QueryIterMinus I see that in the constructor you
>     have to
>     > create the index for the right iterator, but
>     > that index creation is not stoppable. (On a side note, did you
>     try doing
>     > QueryIterMinus similar to QueryIterOptionalIndex,
>     > using QueryIterRepeatApply? I just have the impression that
>     running that
>     > subquery with left values bound and checking
>     > for only one result could be faster, but this is just a
>     suggestion in case
>     > you haven't tried, don't take this
>     > too seriously).
>
>     The right-hand-side index building is once per operator so gets
>     reuses
>     across the whole of traversing the left.
>
>     IndexFactory.create is reading a query iterator.
>
>     So on the surface, query iterators are being read and the time
>     should be
>     seen.
>
>     Unless it hits the LinearIndex case which looks expensive.
>
>     >
>     > If you consider that those are things to be fixed and don't have
>     it in high
>     > priority I could give it a try and send a PR,
>     > if it's ok with you.
>
>     That would be excellent but let's establish the problem first to
>     know it
>     is going to improve things.
>
>     If you can point VisualVM/YourKit/... at a slow query and identify
>     the
>     code hot spots, that would be perfect.
>
>     A small example dataset would also be good so we have a minimal,
>     executable example.
>
>     >
>     > Sorry if I seem too intrusive here, but I'm really interested in
>     having the
>     > timeout work well for any query!
>
>     Not at all - it is part of the point of open source.
>
>          Andy
>
>     >
>     >
>     >
>     > On Fri, 30 Jul 2021 at 04:47, Andy Seaborne <andy@apache.org
>     <ma...@apache.org>> wrote:
>     >
>     >> Hi Cristóbal,
>     >>
>     >> I can't tell from just the queries.
>     >>
>     >> You say the first query has zero results so it is not the ORDER BY.
>     >>
>     >> The queries themselves may be inefficient and changing the
>     order may
>     >> have a significant effect (all data dependent).
>     >>
>     >> All three do not use FILTER NOT EXISTS where they could. That
>     may well
>     >> be faster.
>     >>
>     >> Q1 uses OPTIONAL !BOUND (see inline), Q2 and Q3 use MINUS.
>     >>
>     >> For Q3, does "LIMIT 1" make any difference?
>     >>
>     >>       Andy
>     >>
>     >> On 29/07/2021 15:57, Cristóbal Miranda wrote:
>     >>> Hi Andy,
>     >>>
>     >>> I have three queries in which this happens:
>     >>>
>     >>> 1.
>     >>>
>     >>> SELECT ?var1  ?var2  ?var3  ?var4
>     >>>> WHERE {
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570
>     <http://www.wikidata.org/prop/direct/P570>> ?var2 .
>     >>>>    FILTER (  ( (  ?var2  > "2016-07-30T00:00:00Z"^^<
>     >>>> http://www.w3.org/2001/XMLSchema#dateTime
>     <http://www.w3.org/2001/XMLSchema#dateTime>> ) )
>     >>>> ) .
>     >>>>    FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
>     >>>> http://www.w3.org/2001/XMLSchema#duration
>     <http://www.w3.org/2001/XMLSchema#duration>> )  ) && (  ?var2  <  NOW
>     >> (  )
>     >>>>    ) )
>     >>>> ) .
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31
>     <http://www.wikidata.org/prop/direct/P31>> <
>     >>>> http://www.wikidata.org/entity/Q5
>     <http://www.wikidata.org/entity/Q5>> .
>     >>
>     >> ---------
>     >>>>    OPTIONAL {
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1196
>     <http://www.wikidata.org/prop/direct/P1196>> ?var5 .
>     >>>>    }
>     >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
>     >>
>     >> same as
>     >>
>     >> FILTER
>     >> NOT EXISTS { ?var1 <http://www.wikidata.org/prop/direct/P1196
>     <http://www.wikidata.org/prop/direct/P1196>> ?var5 }
>     >>
>     >> except different cardinality. Because ?var5 isn't used
>     eslewhere, is
>     >> that waht was meant here?
>     >>
>     >>>> ) .
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569
>     <http://www.wikidata.org/prop/direct/P569>> ?var6 .
>     >>>>    FILTER (  ( (  ?var6  > "1954-12-31T00:00:00Z"^^<
>     >>>> http://www.w3.org/2001/XMLSchema#dateTime
>     <http://www.w3.org/2001/XMLSchema#dateTime>> ) )
>     >>>> ) .
>     >>>>    OPTIONAL {
>     >>>>     ?var1  <http://wikiba.se/ontology#statements
>     <http://wikiba.se/ontology#statements>> ?var3 .
>     >>>>    }
>     >>>>    OPTIONAL {
>     >>>>     ?var1  <http://wikiba.se/ontology#sitelinks
>     <http://wikiba.se/ontology#sitelinks>> ?var4 .
>     >>>>    }
>     >>>> }
>     >>>> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
>     >>>>
>     >>>>
>     >>> 2.
>     >>>
>     >>>> SELECT ?var1  ?var2
>     >>>> WHERE {
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1889
>     <http://www.wikidata.org/prop/direct/P1889>> ?var2 .
>     >>>>     MINUS   {
>     >>>>       ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type
>     <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>> <
>     >>>> http://wikiba.se/ontology#Property
>     <http://wikiba.se/ontology#Property>> .
>     >>>>     }
>     >>>> }
>     >>>>
>     >>>>
>     >>> 3.
>     >>>
>     >>>> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
>     >>>> WHERE {
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569
>     <http://www.wikidata.org/prop/direct/P569>> ?var3 .
>     >>>>    FILTER (  ( (  ?var3  > "1956-01-01T00:00:00Z"^^<
>     >>>> http://www.w3.org/2001/XMLSchema#dateTime
>     <http://www.w3.org/2001/XMLSchema#dateTime>> ) && (  ?var3  <
>     >>>>   
>     "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime
>     <http://www.w3.org/2001/XMLSchema#dateTime>>
>     >> ) )
>     >>>> ) .
>     >>>>    OPTIONAL {
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570
>     <http://www.wikidata.org/prop/direct/P570>> ?var5 .
>     >>>>    }
>     >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
>     >>>> ) .
>     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P569
>     <http://www.wikidata.org/prop/direct/P569>> ?var3 .
>     >>>>    OPTIONAL {
>     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P570
>     <http://www.wikidata.org/prop/direct/P570>> ?var6 .
>     >>>>    }
>     >>>>    FILTER (  ( !( BOUND (  ?var6  ) ) )
>     >>>> ) .
>     >>>>     ?var1  <http://www.w3.org/2000/01/rdf-schema#label
>     <http://www.w3.org/2000/01/rdf-schema#label>> ?var4 .
>     >>>>     ?var2  <http://www.w3.org/2000/01/rdf-schema#label
>     <http://www.w3.org/2000/01/rdf-schema#label>> ?var4 .
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31
>     <http://www.wikidata.org/prop/direct/P31>> <
>     >>>> http://www.wikidata.org/entity/Q5
>     <http://www.wikidata.org/entity/Q5>> .
>     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P31
>     <http://www.wikidata.org/prop/direct/P31>> <
>     >>>> http://www.wikidata.org/entity/Q5
>     <http://www.wikidata.org/entity/Q5>> .
>     >>>>     ?var1  <http://www.wikidata.org/prop/direct/P21
>     <http://www.wikidata.org/prop/direct/P21>> ?var7 .
>     >>>>     ?var2  <http://www.wikidata.org/prop/direct/P21
>     <http://www.wikidata.org/prop/direct/P21>> ?var7 .
>     >>>>    FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
>     >>>> ) .
>     >>>>     MINUS   {
>     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P7
>     <http://www.wikidata.org/prop/direct/P7>> ?var2 .
>     >>>>     }
>     >>>>     MINUS   {
>     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P9
>     <http://www.wikidata.org/prop/direct/P9>> ?var2 .
>     >>>>     }
>     >>>>     MINUS   {
>     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P1889
>     <http://www.wikidata.org/prop/direct/P1889>> ?var2 .
>     >>>>     }
>     >>>>     MINUS   {
>     >>>>       ?var1  <http://www.wikidata.org/prop/direct/P460
>     <http://www.wikidata.org/prop/direct/P460>> ?var2 .
>     >>>>     }
>     >>>> }
>     >>>> LIMIT 500
>     >>>
>     >>>
>     >>>
>     >>> The first one is taking 141 minutes,  the second 32 minutes
>     and the third
>     >>> is still running.
>     >>> I have run about 1200 queries, where 38 times the exception
>     was thrown,
>     >> but
>     >>> interestingly not
>     >>> in the first two queries, where I got 0 and 575925 results
>     respectively.
>     >>>
>     >>> I'm using jena 4.1.0.
>     >>>
>     >>>
>     >>>
>     >>> On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <andy@apache.org
>     <ma...@apache.org>> wrote:
>     >>>
>     >>>> Hi Cristóbal,
>     >>>>
>     >>>> What's the query and which version of jena is this?
>     >>>>
>     >>>>        Andy
>     >>>>
>     >>>> On 28/07/2021 19:39, Cristóbal Miranda wrote:
>     >>>>> Hello everyone,
>     >>>>>
>     >>>>> I'm trying to run a sequence of queries with TDB, using a
>     >>>>> locally loaded dataset. I don't want to wait more than a few
>     >>>>> seconds for each query to finish. My attempt to do this
>     looks like
>     >>>>> the following:
>     >>>>>
>     >>>>>>
>     >>>>>> try {
>     >>>>>>      RDFConnection rdfConnection =
>     >> RDFConnectionFactory.connect(dataset);
>     >>>>>>      QueryExecution queryExecution =
>     rdfConnection.query(query);
>     >>>>>> queryExecution.setTimeout(timeoutMilliseconds);
>     >>>>>>      ResultSet resultSet = queryExecution.execSelect();
>     >>>>>>      while (resultSet.hasNext()) {
>     >>>>>>        QuerySolution querySolution = resultSet.next();
>     >>>>>>        ...
>     >>>>>>      }
>     >>>>>> } catch (QueryCancelledException e) {
>     >>>>>>       ...
>     >>>>>> }
>     >>>>>
>     >>>>>
>     >>>>> The problem is that this is not working. With htop I see that
>     >>>>> the process gets stuck in disk operations. One of the queries
>     >>>>> took about 2 hours with the code above. An idea would
>     >>>>> be trying to run this in a new thread and stopping the
>     thread outside
>     >>>> once
>     >>>>> the timeout is reached, but I'm almost sure this wouldn't be
>     a safe
>     >>>>> way to stop the processing, even if it worked.
>     >>>>>
>     >>>>> Is there a better way to do this?
>     >>>>>
>     >>>>> Cristobal
>     >>>>>
>     >>>>
>     >>>
>     >>
>     >
>

Re: How to set a timeout for QueryExecution correctly?

Posted by Cristóbal Miranda <cr...@gmail.com>.
Hi Andy,

Thanks for the clarifications.

I have a program to run a query with timeout from a tdb dataset
and a script to generate an nt file for the second query. Because it is
the simplest among the three, I decided to start with that one. The code
is here https://github.com/CristobalM/jena-query-timeout.

I think I have spotted a problem using that java code. I put that into a
new module in the jena project and modified QueryExecutionBase with
some prints to see that everything is being called. My initial guess to do
this was that there is a concurrency problem with the TimeoutCallback
and cancel() was never being called on the iterator and from what I saw
that seems to be the case.

So what happens is that queryIterator = getPlan().iterator()
at QueryExecutionBase:558 is taking some time to finish and until then,
queryIterator will remain null. Then, if the timeout occurs before that,
queryIterator.cancel() won't be called because to do that there is
a check that queryIterator is not null (QueryExecutionBase:189).

Using `python3 generate_minus_1.py dataset.nt 10000000 9000000`
for the dataset (and loaded with tdb.tdbloader2) and setting a timeout
of 30 seconds, the iterator is not cancelled, but with 60 seconds it is
cancelled.

I tried using 120 seconds of timeout for the full wikidata dataset and the
Q1
won't stop after that time, so I can't be sure if the same is happening
there. I will continue with that query and see if it is the same and
let you know what happens.

Also, I did a sampling with VisualVM, using the Q2 and the dataset
mentioned above and got the result in the attachment.




On Fri, 30 Jul 2021 at 16:33, Andy Seaborne <an...@apache.org> wrote:

>
>
> On 30/07/2021 18:04, Cristóbal Miranda wrote:
> >> For Q3, does "LIMIT 1" make any difference?
> >
> >
> > No, I'm trying it now and it is taking several minutes at least.
> >
> >
> >> except different cardinality. Because ?var5 isn't used eslewhere, is
> >> that waht was meant here?
> >
> >
> > I'm also not too sure about the meaning of the queries in general, I'm
> just
> > using
> > a large dataset of anonymized queries. I took one of the
> > files from here
> https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
> > (they say those are from Wikidata's users) and executed a script to
> filter
> > out some types of queries, the result of that is a file with 27 million
> > queries.
> > But to answer your question about this specific query, since this is
> > Wikidata,
> > what I'd do is to run the query on https://query.wikidata.org/, and
> infer
> > the meaning
> > of the variables, ?var1 is some person and P570 date of death, etc... and
> > from the filters it seems
> > that they want to find those who died in the last 30 days and were born
> > after 1954. ?var5
> > usage seems to indicate that they want to find those whose cause of death
> > is not
> > known (because it's an OPTIONAL and !BOUND would filter out those where
> > there is a cause of death).
> > In this case it is reasonable that I get no results because the dataset
> I'm
> > using is older than 30 days.
> > So yes, you are right, they are equivalent for this query and with NOT
> > EXISTS would
> > probably be faster. But still, what I'd like to have is a way to stop the
> > processing
> > after a few seconds consistently, independent of the query.
> >
> > I just did quick peek at the code and found that, correct me if I'm
> wrong,
> > the cancellation of
> > a query is propagated in a top down manner for the QueryIterators, in a
> way
> > that to notice
> > that the query was aborted you have to get the next result (I'm looking
> > specifically at QueryIterRepeatApply.java:65),
> > meanwhile the query might be stuck at some heavy processing in a lower
> > level, which might be the case for
> > nested OPTIONALs.
>
> Yes - that the mechanism but nested OPTIONALs are optimized to
> "(condition)" which incrementally reads iterators.
>
> In a query plan, ("qparse --print=opt") it shows (condition) not
> (leftjoin).
>
> The (minus) operator may be reading all one side and then missing the
> timeout but it is still advancing one side (the left)
>
> So trying the FILTER NOT EXISTS version of one of the problem queries
> will give us some help in identifying if the problem is MINUS or not.
>
> > In the case of QueryIterMinus I see that in the constructor you have to
> > create the index for the right iterator, but
> > that index creation is not stoppable. (On a side note, did you try doing
> > QueryIterMinus similar to QueryIterOptionalIndex,
> > using QueryIterRepeatApply? I just have the impression that running that
> > subquery with left values bound and checking
> > for only one result could be faster, but this is just a suggestion in
> case
> > you haven't tried, don't take this
> > too seriously).
>
> The right-hand-side index building is once per operator so gets reuses
> across the whole of traversing the left.
>
> IndexFactory.create is reading a query iterator.
>
> So on the surface, query iterators are being read and the time should be
> seen.
>
> Unless it hits the LinearIndex case which looks expensive.
>
> >
> > If you consider that those are things to be fixed and don't have it in
> high
> > priority I could give it a try and send a PR,
> > if it's ok with you.
>
> That would be excellent but let's establish the problem first to know it
> is going to improve things.
>
> If you can point VisualVM/YourKit/... at a slow query and identify the
> code hot spots, that would be perfect.
>
> A small example dataset would also be good so we have a minimal,
> executable example.
>
> >
> > Sorry if I seem too intrusive here, but I'm really interested in having
> the
> > timeout work well for any query!
>
> Not at all - it is part of the point of open source.
>
>      Andy
>
> >
> >
> >
> > On Fri, 30 Jul 2021 at 04:47, Andy Seaborne <an...@apache.org> wrote:
> >
> >> Hi Cristóbal,
> >>
> >> I can't tell from just the queries.
> >>
> >> You say the first query has zero results so it is not the ORDER BY.
> >>
> >> The queries themselves may be inefficient and changing the order may
> >> have a significant effect (all data dependent).
> >>
> >> All three do not use FILTER NOT EXISTS where they could. That may well
> >> be faster.
> >>
> >> Q1 uses OPTIONAL !BOUND (see inline), Q2 and Q3 use MINUS.
> >>
> >> For Q3, does "LIMIT 1" make any difference?
> >>
> >>       Andy
> >>
> >> On 29/07/2021 15:57, Cristóbal Miranda wrote:
> >>> Hi Andy,
> >>>
> >>> I have three queries in which this happens:
> >>>
> >>> 1.
> >>>
> >>> SELECT ?var1  ?var2  ?var3  ?var4
> >>>> WHERE {
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var2 .
> >>>>    FILTER (  ( (  ?var2  >  "2016-07-30T00:00:00Z"^^<
> >>>> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
> >>>> ) .
> >>>>    FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
> >>>> http://www.w3.org/2001/XMLSchema#duration>  )  ) && (  ?var2  <  NOW
> >> (  )
> >>>>    ) )
> >>>> ) .
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31>  <
> >>>> http://www.wikidata.org/entity/Q5> .
> >>
> >> ---------
> >>>>    OPTIONAL {
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1196>  ?var5 .
> >>>>    }
> >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
> >>
> >> same as
> >>
> >> FILTER
> >> NOT EXISTS { ?var1 <http://www.wikidata.org/prop/direct/P1196> ?var5 }
> >>
> >> except different cardinality. Because ?var5 isn't used eslewhere, is
> >> that waht was meant here?
> >>
> >>>> ) .
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var6 .
> >>>>    FILTER (  ( (  ?var6  >  "1954-12-31T00:00:00Z"^^<
> >>>> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
> >>>> ) .
> >>>>    OPTIONAL {
> >>>>     ?var1  <http://wikiba.se/ontology#statements>  ?var3 .
> >>>>    }
> >>>>    OPTIONAL {
> >>>>     ?var1  <http://wikiba.se/ontology#sitelinks>  ?var4 .
> >>>>    }
> >>>> }
> >>>> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
> >>>>
> >>>>
> >>> 2.
> >>>
> >>>> SELECT ?var1  ?var2
> >>>> WHERE {
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
> >>>>     MINUS   {
> >>>>       ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <
> >>>> http://wikiba.se/ontology#Property> .
> >>>>     }
> >>>> }
> >>>>
> >>>>
> >>> 3.
> >>>
> >>>> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
> >>>> WHERE {
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
> >>>>    FILTER (  ( (  ?var3  >  "1956-01-01T00:00:00Z"^^<
> >>>> http://www.w3.org/2001/XMLSchema#dateTime>  ) && (  ?var3  <
> >>>>    "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime
> >
> >> ) )
> >>>> ) .
> >>>>    OPTIONAL {
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var5 .
> >>>>    }
> >>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
> >>>> ) .
> >>>>     ?var2  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
> >>>>    OPTIONAL {
> >>>>     ?var2  <http://www.wikidata.org/prop/direct/P570>  ?var6 .
> >>>>    }
> >>>>    FILTER (  ( !( BOUND (  ?var6  ) ) )
> >>>> ) .
> >>>>     ?var1  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
> >>>>     ?var2  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P31>  <
> >>>> http://www.wikidata.org/entity/Q5> .
> >>>>     ?var2  <http://www.wikidata.org/prop/direct/P31>  <
> >>>> http://www.wikidata.org/entity/Q5> .
> >>>>     ?var1  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
> >>>>     ?var2  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
> >>>>    FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
> >>>> ) .
> >>>>     MINUS   {
> >>>>       ?var1  <http://www.wikidata.org/prop/direct/P7>  ?var2 .
> >>>>     }
> >>>>     MINUS   {
> >>>>       ?var1  <http://www.wikidata.org/prop/direct/P9>  ?var2 .
> >>>>     }
> >>>>     MINUS   {
> >>>>       ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
> >>>>     }
> >>>>     MINUS   {
> >>>>       ?var1  <http://www.wikidata.org/prop/direct/P460>  ?var2 .
> >>>>     }
> >>>> }
> >>>> LIMIT 500
> >>>
> >>>
> >>>
> >>> The first one is taking 141 minutes,  the second 32 minutes and the
> third
> >>> is still running.
> >>> I have run about 1200 queries, where 38 times the exception was thrown,
> >> but
> >>> interestingly not
> >>> in the first two queries, where I got 0 and 575925 results
> respectively.
> >>>
> >>> I'm using jena 4.1.0.
> >>>
> >>>
> >>>
> >>> On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <an...@apache.org> wrote:
> >>>
> >>>> Hi Cristóbal,
> >>>>
> >>>> What's the query and which version of jena is this?
> >>>>
> >>>>        Andy
> >>>>
> >>>> On 28/07/2021 19:39, Cristóbal Miranda wrote:
> >>>>> Hello everyone,
> >>>>>
> >>>>> I'm trying to run a sequence of queries with TDB, using a
> >>>>> locally loaded dataset. I don't want to wait more than a few
> >>>>> seconds for each query to finish. My attempt to do this looks like
> >>>>> the following:
> >>>>>
> >>>>>>
> >>>>>> try {
> >>>>>>      RDFConnection rdfConnection =
> >> RDFConnectionFactory.connect(dataset);
> >>>>>>      QueryExecution queryExecution = rdfConnection.query(query);
> >>>>>>      queryExecution.setTimeout(timeoutMilliseconds);
> >>>>>>      ResultSet resultSet = queryExecution.execSelect();
> >>>>>>      while (resultSet.hasNext()) {
> >>>>>>        QuerySolution querySolution = resultSet.next();
> >>>>>>        ...
> >>>>>>      }
> >>>>>> } catch (QueryCancelledException e) {
> >>>>>>       ...
> >>>>>> }
> >>>>>
> >>>>>
> >>>>> The problem is that this is not working. With htop I see that
> >>>>> the process gets stuck in disk operations. One of the queries
> >>>>> took about 2 hours with the code above. An idea would
> >>>>> be trying to run this in a new thread and stopping the thread outside
> >>>> once
> >>>>> the timeout is reached, but I'm almost sure this wouldn't be a safe
> >>>>> way to stop the processing, even if it worked.
> >>>>>
> >>>>> Is there a better way to do this?
> >>>>>
> >>>>> Cristobal
> >>>>>
> >>>>
> >>>
> >>
> >
>

Re: How to set a timeout for QueryExecution correctly?

Posted by Andy Seaborne <an...@apache.org>.

On 30/07/2021 18:04, Cristóbal Miranda wrote:
>> For Q3, does "LIMIT 1" make any difference?
> 
> 
> No, I'm trying it now and it is taking several minutes at least.
> 
> 
>> except different cardinality. Because ?var5 isn't used eslewhere, is
>> that waht was meant here?
> 
> 
> I'm also not too sure about the meaning of the queries in general, I'm just
> using
> a large dataset of anonymized queries. I took one of the
> files from here https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
> (they say those are from Wikidata's users) and executed a script to filter
> out some types of queries, the result of that is a file with 27 million
> queries.
> But to answer your question about this specific query, since this is
> Wikidata,
> what I'd do is to run the query on https://query.wikidata.org/, and infer
> the meaning
> of the variables, ?var1 is some person and P570 date of death, etc... and
> from the filters it seems
> that they want to find those who died in the last 30 days and were born
> after 1954. ?var5
> usage seems to indicate that they want to find those whose cause of death
> is not
> known (because it's an OPTIONAL and !BOUND would filter out those where
> there is a cause of death).
> In this case it is reasonable that I get no results because the dataset I'm
> using is older than 30 days.
> So yes, you are right, they are equivalent for this query and with NOT
> EXISTS would
> probably be faster. But still, what I'd like to have is a way to stop the
> processing
> after a few seconds consistently, independent of the query.
> 
> I just did quick peek at the code and found that, correct me if I'm wrong,
> the cancellation of
> a query is propagated in a top down manner for the QueryIterators, in a way
> that to notice
> that the query was aborted you have to get the next result (I'm looking
> specifically at QueryIterRepeatApply.java:65),
> meanwhile the query might be stuck at some heavy processing in a lower
> level, which might be the case for
> nested OPTIONALs.

Yes - that the mechanism but nested OPTIONALs are optimized to 
"(condition)" which incrementally reads iterators.

In a query plan, ("qparse --print=opt") it shows (condition) not (leftjoin).

The (minus) operator may be reading all one side and then missing the 
timeout but it is still advancing one side (the left)

So trying the FILTER NOT EXISTS version of one of the problem queries 
will give us some help in identifying if the problem is MINUS or not.

> In the case of QueryIterMinus I see that in the constructor you have to
> create the index for the right iterator, but
> that index creation is not stoppable. (On a side note, did you try doing
> QueryIterMinus similar to QueryIterOptionalIndex,
> using QueryIterRepeatApply? I just have the impression that running that
> subquery with left values bound and checking
> for only one result could be faster, but this is just a suggestion in case
> you haven't tried, don't take this
> too seriously).

The right-hand-side index building is once per operator so gets reuses 
across the whole of traversing the left.

IndexFactory.create is reading a query iterator.

So on the surface, query iterators are being read and the time should be 
seen.

Unless it hits the LinearIndex case which looks expensive.

> 
> If you consider that those are things to be fixed and don't have it in high
> priority I could give it a try and send a PR,
> if it's ok with you.

That would be excellent but let's establish the problem first to know it 
is going to improve things.

If you can point VisualVM/YourKit/... at a slow query and identify the 
code hot spots, that would be perfect.

A small example dataset would also be good so we have a minimal, 
executable example.

> 
> Sorry if I seem too intrusive here, but I'm really interested in having the
> timeout work well for any query!

Not at all - it is part of the point of open source.

     Andy

> 
> 
> 
> On Fri, 30 Jul 2021 at 04:47, Andy Seaborne <an...@apache.org> wrote:
> 
>> Hi Cristóbal,
>>
>> I can't tell from just the queries.
>>
>> You say the first query has zero results so it is not the ORDER BY.
>>
>> The queries themselves may be inefficient and changing the order may
>> have a significant effect (all data dependent).
>>
>> All three do not use FILTER NOT EXISTS where they could. That may well
>> be faster.
>>
>> Q1 uses OPTIONAL !BOUND (see inline), Q2 and Q3 use MINUS.
>>
>> For Q3, does "LIMIT 1" make any difference?
>>
>>       Andy
>>
>> On 29/07/2021 15:57, Cristóbal Miranda wrote:
>>> Hi Andy,
>>>
>>> I have three queries in which this happens:
>>>
>>> 1.
>>>
>>> SELECT ?var1  ?var2  ?var3  ?var4
>>>> WHERE {
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var2 .
>>>>    FILTER (  ( (  ?var2  >  "2016-07-30T00:00:00Z"^^<
>>>> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
>>>> ) .
>>>>    FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
>>>> http://www.w3.org/2001/XMLSchema#duration>  )  ) && (  ?var2  <  NOW
>> (  )
>>>>    ) )
>>>> ) .
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P31>  <
>>>> http://www.wikidata.org/entity/Q5> .
>>
>> ---------
>>>>    OPTIONAL {
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P1196>  ?var5 .
>>>>    }
>>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
>>
>> same as
>>
>> FILTER
>> NOT EXISTS { ?var1 <http://www.wikidata.org/prop/direct/P1196> ?var5 }
>>
>> except different cardinality. Because ?var5 isn't used eslewhere, is
>> that waht was meant here?
>>
>>>> ) .
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var6 .
>>>>    FILTER (  ( (  ?var6  >  "1954-12-31T00:00:00Z"^^<
>>>> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
>>>> ) .
>>>>    OPTIONAL {
>>>>     ?var1  <http://wikiba.se/ontology#statements>  ?var3 .
>>>>    }
>>>>    OPTIONAL {
>>>>     ?var1  <http://wikiba.se/ontology#sitelinks>  ?var4 .
>>>>    }
>>>> }
>>>> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
>>>>
>>>>
>>> 2.
>>>
>>>> SELECT ?var1  ?var2
>>>> WHERE {
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
>>>>     MINUS   {
>>>>       ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <
>>>> http://wikiba.se/ontology#Property> .
>>>>     }
>>>> }
>>>>
>>>>
>>> 3.
>>>
>>>> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
>>>> WHERE {
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
>>>>    FILTER (  ( (  ?var3  >  "1956-01-01T00:00:00Z"^^<
>>>> http://www.w3.org/2001/XMLSchema#dateTime>  ) && (  ?var3  <
>>>>    "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
>> ) )
>>>> ) .
>>>>    OPTIONAL {
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var5 .
>>>>    }
>>>>    FILTER (  ( !( BOUND (  ?var5  ) ) )
>>>> ) .
>>>>     ?var2  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
>>>>    OPTIONAL {
>>>>     ?var2  <http://www.wikidata.org/prop/direct/P570>  ?var6 .
>>>>    }
>>>>    FILTER (  ( !( BOUND (  ?var6  ) ) )
>>>> ) .
>>>>     ?var1  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
>>>>     ?var2  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P31>  <
>>>> http://www.wikidata.org/entity/Q5> .
>>>>     ?var2  <http://www.wikidata.org/prop/direct/P31>  <
>>>> http://www.wikidata.org/entity/Q5> .
>>>>     ?var1  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
>>>>     ?var2  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
>>>>    FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
>>>> ) .
>>>>     MINUS   {
>>>>       ?var1  <http://www.wikidata.org/prop/direct/P7>  ?var2 .
>>>>     }
>>>>     MINUS   {
>>>>       ?var1  <http://www.wikidata.org/prop/direct/P9>  ?var2 .
>>>>     }
>>>>     MINUS   {
>>>>       ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
>>>>     }
>>>>     MINUS   {
>>>>       ?var1  <http://www.wikidata.org/prop/direct/P460>  ?var2 .
>>>>     }
>>>> }
>>>> LIMIT 500
>>>
>>>
>>>
>>> The first one is taking 141 minutes,  the second 32 minutes and the third
>>> is still running.
>>> I have run about 1200 queries, where 38 times the exception was thrown,
>> but
>>> interestingly not
>>> in the first two queries, where I got 0 and 575925 results respectively.
>>>
>>> I'm using jena 4.1.0.
>>>
>>>
>>>
>>> On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <an...@apache.org> wrote:
>>>
>>>> Hi Cristóbal,
>>>>
>>>> What's the query and which version of jena is this?
>>>>
>>>>        Andy
>>>>
>>>> On 28/07/2021 19:39, Cristóbal Miranda wrote:
>>>>> Hello everyone,
>>>>>
>>>>> I'm trying to run a sequence of queries with TDB, using a
>>>>> locally loaded dataset. I don't want to wait more than a few
>>>>> seconds for each query to finish. My attempt to do this looks like
>>>>> the following:
>>>>>
>>>>>>
>>>>>> try {
>>>>>>      RDFConnection rdfConnection =
>> RDFConnectionFactory.connect(dataset);
>>>>>>      QueryExecution queryExecution = rdfConnection.query(query);
>>>>>>      queryExecution.setTimeout(timeoutMilliseconds);
>>>>>>      ResultSet resultSet = queryExecution.execSelect();
>>>>>>      while (resultSet.hasNext()) {
>>>>>>        QuerySolution querySolution = resultSet.next();
>>>>>>        ...
>>>>>>      }
>>>>>> } catch (QueryCancelledException e) {
>>>>>>       ...
>>>>>> }
>>>>>
>>>>>
>>>>> The problem is that this is not working. With htop I see that
>>>>> the process gets stuck in disk operations. One of the queries
>>>>> took about 2 hours with the code above. An idea would
>>>>> be trying to run this in a new thread and stopping the thread outside
>>>> once
>>>>> the timeout is reached, but I'm almost sure this wouldn't be a safe
>>>>> way to stop the processing, even if it worked.
>>>>>
>>>>> Is there a better way to do this?
>>>>>
>>>>> Cristobal
>>>>>
>>>>
>>>
>>
> 

Re: How to set a timeout for QueryExecution correctly?

Posted by Cristóbal Miranda <cr...@gmail.com>.
> For Q3, does "LIMIT 1" make any difference?


No, I'm trying it now and it is taking several minutes at least.


> except different cardinality. Because ?var5 isn't used eslewhere, is
> that waht was meant here?


I'm also not too sure about the meaning of the queries in general, I'm just
using
a large dataset of anonymized queries. I took one of the
files from here https://iccl.inf.tu-dresden.de/web/Wikidata_SPARQL_Logs/en
(they say those are from Wikidata's users) and executed a script to filter
out some types of queries, the result of that is a file with 27 million
queries.
But to answer your question about this specific query, since this is
Wikidata,
what I'd do is to run the query on https://query.wikidata.org/, and infer
the meaning
of the variables, ?var1 is some person and P570 date of death, etc... and
from the filters it seems
that they want to find those who died in the last 30 days and were born
after 1954. ?var5
usage seems to indicate that they want to find those whose cause of death
is not
known (because it's an OPTIONAL and !BOUND would filter out those where
there is a cause of death).
In this case it is reasonable that I get no results because the dataset I'm
using is older than 30 days.
So yes, you are right, they are equivalent for this query and with NOT
EXISTS would
probably be faster. But still, what I'd like to have is a way to stop the
processing
after a few seconds consistently, independent of the query.

I just did quick peek at the code and found that, correct me if I'm wrong,
the cancellation of
a query is propagated in a top down manner for the QueryIterators, in a way
that to notice
that the query was aborted you have to get the next result (I'm looking
specifically at QueryIterRepeatApply.java:65),
meanwhile the query might be stuck at some heavy processing in a lower
level, which might be the case for
nested OPTIONALs.
In the case of QueryIterMinus I see that in the constructor you have to
create the index for the right iterator, but
that index creation is not stoppable. (On a side note, did you try doing
QueryIterMinus similar to QueryIterOptionalIndex,
using QueryIterRepeatApply? I just have the impression that running that
subquery with left values bound and checking
for only one result could be faster, but this is just a suggestion in case
you haven't tried, don't take this
too seriously).

If you consider that those are things to be fixed and don't have it in high
priority I could give it a try and send a PR,
if it's ok with you.

Sorry if I seem too intrusive here, but I'm really interested in having the
timeout work well for any query!



On Fri, 30 Jul 2021 at 04:47, Andy Seaborne <an...@apache.org> wrote:

> Hi Cristóbal,
>
> I can't tell from just the queries.
>
> You say the first query has zero results so it is not the ORDER BY.
>
> The queries themselves may be inefficient and changing the order may
> have a significant effect (all data dependent).
>
> All three do not use FILTER NOT EXISTS where they could. That may well
> be faster.
>
> Q1 uses OPTIONAL !BOUND (see inline), Q2 and Q3 use MINUS.
>
> For Q3, does "LIMIT 1" make any difference?
>
>      Andy
>
> On 29/07/2021 15:57, Cristóbal Miranda wrote:
> > Hi Andy,
> >
> > I have three queries in which this happens:
> >
> > 1.
> >
> > SELECT ?var1  ?var2  ?var3  ?var4
> >> WHERE {
> >>    ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var2 .
> >>   FILTER (  ( (  ?var2  >  "2016-07-30T00:00:00Z"^^<
> >> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
> >> ) .
> >>   FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
> >> http://www.w3.org/2001/XMLSchema#duration>  )  ) && (  ?var2  <  NOW
> (  )
> >>   ) )
> >> ) .
> >>    ?var1  <http://www.wikidata.org/prop/direct/P31>  <
> >> http://www.wikidata.org/entity/Q5> .
>
> ---------
> >>   OPTIONAL {
> >>    ?var1  <http://www.wikidata.org/prop/direct/P1196>  ?var5 .
> >>   }
> >>   FILTER (  ( !( BOUND (  ?var5  ) ) )
>
> same as
>
> FILTER
> NOT EXISTS { ?var1 <http://www.wikidata.org/prop/direct/P1196> ?var5 }
>
> except different cardinality. Because ?var5 isn't used eslewhere, is
> that waht was meant here?
>
> >> ) .
> >>    ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var6 .
> >>   FILTER (  ( (  ?var6  >  "1954-12-31T00:00:00Z"^^<
> >> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
> >> ) .
> >>   OPTIONAL {
> >>    ?var1  <http://wikiba.se/ontology#statements>  ?var3 .
> >>   }
> >>   OPTIONAL {
> >>    ?var1  <http://wikiba.se/ontology#sitelinks>  ?var4 .
> >>   }
> >> }
> >> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
> >>
> >>
> > 2.
> >
> >> SELECT ?var1  ?var2
> >> WHERE {
> >>    ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
> >>    MINUS   {
> >>      ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <
> >> http://wikiba.se/ontology#Property> .
> >>    }
> >> }
> >>
> >>
> > 3.
> >
> >> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
> >> WHERE {
> >>    ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
> >>   FILTER (  ( (  ?var3  >  "1956-01-01T00:00:00Z"^^<
> >> http://www.w3.org/2001/XMLSchema#dateTime>  ) && (  ?var3  <
> >>   "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>
> ) )
> >> ) .
> >>   OPTIONAL {
> >>    ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var5 .
> >>   }
> >>   FILTER (  ( !( BOUND (  ?var5  ) ) )
> >> ) .
> >>    ?var2  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
> >>   OPTIONAL {
> >>    ?var2  <http://www.wikidata.org/prop/direct/P570>  ?var6 .
> >>   }
> >>   FILTER (  ( !( BOUND (  ?var6  ) ) )
> >> ) .
> >>    ?var1  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
> >>    ?var2  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
> >>    ?var1  <http://www.wikidata.org/prop/direct/P31>  <
> >> http://www.wikidata.org/entity/Q5> .
> >>    ?var2  <http://www.wikidata.org/prop/direct/P31>  <
> >> http://www.wikidata.org/entity/Q5> .
> >>    ?var1  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
> >>    ?var2  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
> >>   FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
> >> ) .
> >>    MINUS   {
> >>      ?var1  <http://www.wikidata.org/prop/direct/P7>  ?var2 .
> >>    }
> >>    MINUS   {
> >>      ?var1  <http://www.wikidata.org/prop/direct/P9>  ?var2 .
> >>    }
> >>    MINUS   {
> >>      ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
> >>    }
> >>    MINUS   {
> >>      ?var1  <http://www.wikidata.org/prop/direct/P460>  ?var2 .
> >>    }
> >> }
> >> LIMIT 500
> >
> >
> >
> > The first one is taking 141 minutes,  the second 32 minutes and the third
> > is still running.
> > I have run about 1200 queries, where 38 times the exception was thrown,
> but
> > interestingly not
> > in the first two queries, where I got 0 and 575925 results respectively.
> >
> > I'm using jena 4.1.0.
> >
> >
> >
> > On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <an...@apache.org> wrote:
> >
> >> Hi Cristóbal,
> >>
> >> What's the query and which version of jena is this?
> >>
> >>       Andy
> >>
> >> On 28/07/2021 19:39, Cristóbal Miranda wrote:
> >>> Hello everyone,
> >>>
> >>> I'm trying to run a sequence of queries with TDB, using a
> >>> locally loaded dataset. I don't want to wait more than a few
> >>> seconds for each query to finish. My attempt to do this looks like
> >>> the following:
> >>>
> >>>>
> >>>> try {
> >>>>     RDFConnection rdfConnection =
> RDFConnectionFactory.connect(dataset);
> >>>>     QueryExecution queryExecution = rdfConnection.query(query);
> >>>>     queryExecution.setTimeout(timeoutMilliseconds);
> >>>>     ResultSet resultSet = queryExecution.execSelect();
> >>>>     while (resultSet.hasNext()) {
> >>>>       QuerySolution querySolution = resultSet.next();
> >>>>       ...
> >>>>     }
> >>>> } catch (QueryCancelledException e) {
> >>>>      ...
> >>>> }
> >>>
> >>>
> >>> The problem is that this is not working. With htop I see that
> >>> the process gets stuck in disk operations. One of the queries
> >>> took about 2 hours with the code above. An idea would
> >>> be trying to run this in a new thread and stopping the thread outside
> >> once
> >>> the timeout is reached, but I'm almost sure this wouldn't be a safe
> >>> way to stop the processing, even if it worked.
> >>>
> >>> Is there a better way to do this?
> >>>
> >>> Cristobal
> >>>
> >>
> >
>

Re: How to set a timeout for QueryExecution correctly?

Posted by Andy Seaborne <an...@apache.org>.
Hi Cristóbal,

I can't tell from just the queries.

You say the first query has zero results so it is not the ORDER BY.

The queries themselves may be inefficient and changing the order may 
have a significant effect (all data dependent).

All three do not use FILTER NOT EXISTS where they could. That may well 
be faster.

Q1 uses OPTIONAL !BOUND (see inline), Q2 and Q3 use MINUS.

For Q3, does "LIMIT 1" make any difference?

     Andy

On 29/07/2021 15:57, Cristóbal Miranda wrote:
> Hi Andy,
> 
> I have three queries in which this happens:
> 
> 1.
> 
> SELECT ?var1  ?var2  ?var3  ?var4
>> WHERE {
>>    ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var2 .
>>   FILTER (  ( (  ?var2  >  "2016-07-30T00:00:00Z"^^<
>> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
>> ) .
>>   FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
>> http://www.w3.org/2001/XMLSchema#duration>  )  ) && (  ?var2  <  NOW (  )
>>   ) )
>> ) .
>>    ?var1  <http://www.wikidata.org/prop/direct/P31>  <
>> http://www.wikidata.org/entity/Q5> .

---------
>>   OPTIONAL {
>>    ?var1  <http://www.wikidata.org/prop/direct/P1196>  ?var5 .
>>   }
>>   FILTER (  ( !( BOUND (  ?var5  ) ) )

same as

FILTER
NOT EXISTS { ?var1 <http://www.wikidata.org/prop/direct/P1196> ?var5 }

except different cardinality. Because ?var5 isn't used eslewhere, is 
that waht was meant here?

>> ) .
>>    ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var6 .
>>   FILTER (  ( (  ?var6  >  "1954-12-31T00:00:00Z"^^<
>> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
>> ) .
>>   OPTIONAL {
>>    ?var1  <http://wikiba.se/ontology#statements>  ?var3 .
>>   }
>>   OPTIONAL {
>>    ?var1  <http://wikiba.se/ontology#sitelinks>  ?var4 .
>>   }
>> }
>> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
>>
>>
> 2.
> 
>> SELECT ?var1  ?var2
>> WHERE {
>>    ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
>>    MINUS   {
>>      ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <
>> http://wikiba.se/ontology#Property> .
>>    }
>> }
>>
>>
> 3.
> 
>> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
>> WHERE {
>>    ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
>>   FILTER (  ( (  ?var3  >  "1956-01-01T00:00:00Z"^^<
>> http://www.w3.org/2001/XMLSchema#dateTime>  ) && (  ?var3  <
>>   "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>  ) )
>> ) .
>>   OPTIONAL {
>>    ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var5 .
>>   }
>>   FILTER (  ( !( BOUND (  ?var5  ) ) )
>> ) .
>>    ?var2  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
>>   OPTIONAL {
>>    ?var2  <http://www.wikidata.org/prop/direct/P570>  ?var6 .
>>   }
>>   FILTER (  ( !( BOUND (  ?var6  ) ) )
>> ) .
>>    ?var1  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
>>    ?var2  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
>>    ?var1  <http://www.wikidata.org/prop/direct/P31>  <
>> http://www.wikidata.org/entity/Q5> .
>>    ?var2  <http://www.wikidata.org/prop/direct/P31>  <
>> http://www.wikidata.org/entity/Q5> .
>>    ?var1  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
>>    ?var2  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
>>   FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
>> ) .
>>    MINUS   {
>>      ?var1  <http://www.wikidata.org/prop/direct/P7>  ?var2 .
>>    }
>>    MINUS   {
>>      ?var1  <http://www.wikidata.org/prop/direct/P9>  ?var2 .
>>    }
>>    MINUS   {
>>      ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
>>    }
>>    MINUS   {
>>      ?var1  <http://www.wikidata.org/prop/direct/P460>  ?var2 .
>>    }
>> }
>> LIMIT 500
> 
> 
> 
> The first one is taking 141 minutes,  the second 32 minutes and the third
> is still running.
> I have run about 1200 queries, where 38 times the exception was thrown, but
> interestingly not
> in the first two queries, where I got 0 and 575925 results respectively.
> 
> I'm using jena 4.1.0.
> 
> 
> 
> On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <an...@apache.org> wrote:
> 
>> Hi Cristóbal,
>>
>> What's the query and which version of jena is this?
>>
>>       Andy
>>
>> On 28/07/2021 19:39, Cristóbal Miranda wrote:
>>> Hello everyone,
>>>
>>> I'm trying to run a sequence of queries with TDB, using a
>>> locally loaded dataset. I don't want to wait more than a few
>>> seconds for each query to finish. My attempt to do this looks like
>>> the following:
>>>
>>>>
>>>> try {
>>>>     RDFConnection rdfConnection = RDFConnectionFactory.connect(dataset);
>>>>     QueryExecution queryExecution = rdfConnection.query(query);
>>>>     queryExecution.setTimeout(timeoutMilliseconds);
>>>>     ResultSet resultSet = queryExecution.execSelect();
>>>>     while (resultSet.hasNext()) {
>>>>       QuerySolution querySolution = resultSet.next();
>>>>       ...
>>>>     }
>>>> } catch (QueryCancelledException e) {
>>>>      ...
>>>> }
>>>
>>>
>>> The problem is that this is not working. With htop I see that
>>> the process gets stuck in disk operations. One of the queries
>>> took about 2 hours with the code above. An idea would
>>> be trying to run this in a new thread and stopping the thread outside
>> once
>>> the timeout is reached, but I'm almost sure this wouldn't be a safe
>>> way to stop the processing, even if it worked.
>>>
>>> Is there a better way to do this?
>>>
>>> Cristobal
>>>
>>
> 

Re: How to set a timeout for QueryExecution correctly?

Posted by Cristóbal Miranda <cr...@gmail.com>.
Hi Andy,

I have three queries in which this happens:

1.

SELECT ?var1  ?var2  ?var3  ?var4
> WHERE {
>   ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var2 .
>  FILTER (  ( (  ?var2  >  "2016-07-30T00:00:00Z"^^<
> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
> ) .
>  FILTER (  ( (  ?var2  >  (  NOW (  ) - "P32D"^^<
> http://www.w3.org/2001/XMLSchema#duration>  )  ) && (  ?var2  <  NOW (  )
>  ) )
> ) .
>   ?var1  <http://www.wikidata.org/prop/direct/P31>  <
> http://www.wikidata.org/entity/Q5> .
>  OPTIONAL {
>   ?var1  <http://www.wikidata.org/prop/direct/P1196>  ?var5 .
>  }
>  FILTER (  ( !( BOUND (  ?var5  ) ) )
> ) .
>   ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var6 .
>  FILTER (  ( (  ?var6  >  "1954-12-31T00:00:00Z"^^<
> http://www.w3.org/2001/XMLSchema#dateTime>  ) )
> ) .
>  OPTIONAL {
>   ?var1  <http://wikiba.se/ontology#statements>  ?var3 .
>  }
>  OPTIONAL {
>   ?var1  <http://wikiba.se/ontology#sitelinks>  ?var4 .
>  }
> }
> ORDER BY  DESC( ?var4 ) DESC( ?var2 )ASC( ?var1 )
>
>
2.

> SELECT ?var1  ?var2
> WHERE {
>   ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
>   MINUS   {
>     ?var1  <http://www.w3.org/1999/02/22-rdf-syntax-ns#type>  <
> http://wikiba.se/ontology#Property> .
>   }
> }
>
>
3.

> SELECT DISTINCT ?var1  ?var2  ?var3  ?var4
> WHERE {
>   ?var1  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
>  FILTER (  ( (  ?var3  >  "1956-01-01T00:00:00Z"^^<
> http://www.w3.org/2001/XMLSchema#dateTime>  ) && (  ?var3  <
>  "1957-01-01T00:00:00Z"^^<http://www.w3.org/2001/XMLSchema#dateTime>  ) )
> ) .
>  OPTIONAL {
>   ?var1  <http://www.wikidata.org/prop/direct/P570>  ?var5 .
>  }
>  FILTER (  ( !( BOUND (  ?var5  ) ) )
> ) .
>   ?var2  <http://www.wikidata.org/prop/direct/P569>  ?var3 .
>  OPTIONAL {
>   ?var2  <http://www.wikidata.org/prop/direct/P570>  ?var6 .
>  }
>  FILTER (  ( !( BOUND (  ?var6  ) ) )
> ) .
>   ?var1  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
>   ?var2  <http://www.w3.org/2000/01/rdf-schema#label>  ?var4 .
>   ?var1  <http://www.wikidata.org/prop/direct/P31>  <
> http://www.wikidata.org/entity/Q5> .
>   ?var2  <http://www.wikidata.org/prop/direct/P31>  <
> http://www.wikidata.org/entity/Q5> .
>   ?var1  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
>   ?var2  <http://www.wikidata.org/prop/direct/P21>  ?var7 .
>  FILTER (  ( (  STR (  ?var1  )  <  STR (  ?var2  )  ) )
> ) .
>   MINUS   {
>     ?var1  <http://www.wikidata.org/prop/direct/P7>  ?var2 .
>   }
>   MINUS   {
>     ?var1  <http://www.wikidata.org/prop/direct/P9>  ?var2 .
>   }
>   MINUS   {
>     ?var1  <http://www.wikidata.org/prop/direct/P1889>  ?var2 .
>   }
>   MINUS   {
>     ?var1  <http://www.wikidata.org/prop/direct/P460>  ?var2 .
>   }
> }
> LIMIT 500



The first one is taking 141 minutes,  the second 32 minutes and the third
is still running.
I have run about 1200 queries, where 38 times the exception was thrown, but
interestingly not
in the first two queries, where I got 0 and 575925 results respectively.

I'm using jena 4.1.0.



On Thu, 29 Jul 2021 at 05:27, Andy Seaborne <an...@apache.org> wrote:

> Hi Cristóbal,
>
> What's the query and which version of jena is this?
>
>      Andy
>
> On 28/07/2021 19:39, Cristóbal Miranda wrote:
> > Hello everyone,
> >
> > I'm trying to run a sequence of queries with TDB, using a
> > locally loaded dataset. I don't want to wait more than a few
> > seconds for each query to finish. My attempt to do this looks like
> > the following:
> >
> >>
> >> try {
> >>    RDFConnection rdfConnection = RDFConnectionFactory.connect(dataset);
> >>    QueryExecution queryExecution = rdfConnection.query(query);
> >>    queryExecution.setTimeout(timeoutMilliseconds);
> >>    ResultSet resultSet = queryExecution.execSelect();
> >>    while (resultSet.hasNext()) {
> >>      QuerySolution querySolution = resultSet.next();
> >>      ...
> >>    }
> >> } catch (QueryCancelledException e) {
> >>     ...
> >> }
> >
> >
> > The problem is that this is not working. With htop I see that
> > the process gets stuck in disk operations. One of the queries
> > took about 2 hours with the code above. An idea would
> > be trying to run this in a new thread and stopping the thread outside
> once
> > the timeout is reached, but I'm almost sure this wouldn't be a safe
> > way to stop the processing, even if it worked.
> >
> > Is there a better way to do this?
> >
> > Cristobal
> >
>

Re: How to set a timeout for QueryExecution correctly?

Posted by Andy Seaborne <an...@apache.org>.
Hi Cristóbal,

What's the query and which version of jena is this?

     Andy

On 28/07/2021 19:39, Cristóbal Miranda wrote:
> Hello everyone,
> 
> I'm trying to run a sequence of queries with TDB, using a
> locally loaded dataset. I don't want to wait more than a few
> seconds for each query to finish. My attempt to do this looks like
> the following:
> 
>>
>> try {
>>    RDFConnection rdfConnection = RDFConnectionFactory.connect(dataset);
>>    QueryExecution queryExecution = rdfConnection.query(query);
>>    queryExecution.setTimeout(timeoutMilliseconds);
>>    ResultSet resultSet = queryExecution.execSelect();
>>    while (resultSet.hasNext()) {
>>      QuerySolution querySolution = resultSet.next();
>>      ...
>>    }
>> } catch (QueryCancelledException e) {
>>     ...
>> }
> 
> 
> The problem is that this is not working. With htop I see that
> the process gets stuck in disk operations. One of the queries
> took about 2 hours with the code above. An idea would
> be trying to run this in a new thread and stopping the thread outside once
> the timeout is reached, but I'm almost sure this wouldn't be a safe
> way to stop the processing, even if it worked.
> 
> Is there a better way to do this?
> 
> Cristobal
>