You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Élie Roux <el...@telecom-bretagne.eu> on 2020/08/27 08:10:00 UTC

shortcut for querying dates fast?

Dear all,

I have a dataset with (among other things) about 400,000 triples in the form

?a adm:logDate ?d

where ?d is an xsd:dateTime. I'm writing a query to get all the
triples that have a ?d in a certain interval. There are usually very
few of them (around say 200). I'm writing a query that looks like

construct {
    ?va  adm:hasactivityon ?d .
} where {
    ?le adm:logDate ?d .
     FILTER(?d > "2020-08-01T00:00:00"^^xsd:dateTime)
    ?va adm:logEntry ?le .
}

But it's too slow for our purpose (3.5s). I suspect it's conceptually
simple to have very performant implementation (using an index
dedicated to xsd:dateTime literal that could be queried), but I also
suspect SPARQL doesn't make that kind of performant algorithm to
summon in such a query (which is a mix of a bgp and a filter instead
of a direct call to a performant index).

So a few questions:
- are there other ways to write this query to make it more performant?
- my impression is that what I want with time is similar to what
GeoSPARQL provides for space... is there something similar to
GeoSPARQL for time?
- would that kind of performant index require the same type of
mechanism as the jena:text extension?
- is it worth reporting this on the SPARQL 1.2 github repo?

Thanks in advance,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Andy Seaborne <an...@apache.org>.


On 04/09/2020 14:38, Élie Roux wrote:
>> select ?le ?la where {
>>       ?le adm:logDate ?sdate .
>>       FILTER(?sdate > "2021-08-20T00:00:00"^^xsd:dateTime)
>>       BIND(?le AS ?la)
>> }
> 
> In that trivial case yes, the longer query doesn't allow that:

Put the common part in a single place:


WHERE {
    ... common part ...
    {
      ...
    } UNION {
      ...
    }
}

and reorder to put common part together.

> 
> construct {
>    ?eadm tmp:lastSync ?d .
>    ?eadm tmp:dateCreated ?sdate  .
>   }
> WHERE
>    {
>      {
>          ?le adm:logDate ?sdate .
>          FILTER(?sdate > "2020-08-20T00:00:00"^^xsd:dateTime)
>          ?le a adm:Synced .
>          ?va adm:logEntry ?le ;
>              adm:adminAbout ?v .
>          ?v bdo:volumeOf ?iinstance .
>          ?eadm adm:adminAbout ?iinstance .
>      } union {
>          ?le adm:logDate ?sdate .
>          FILTER(?sdate > "2020-08-20T00:00:00"^^xsd:dateTime)
>          ?le a adm:InitialDataCreation .
>          ?eadm adm:logEntry ?le ;
>              adm:adminAbout ?e .
>          ?e a ?type .
>          FILTER (?type != bdo:ImageGroup)
>      }
>    }
> 
> and takes 3.5s.
> 
> Best,
>

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> select ?le ?la where {
>      ?le adm:logDate ?sdate .
>      FILTER(?sdate > "2021-08-20T00:00:00"^^xsd:dateTime)
>      BIND(?le AS ?la)
> }

In that trivial case yes, the longer query doesn't allow that:

construct {
  ?eadm tmp:lastSync ?d .
  ?eadm tmp:dateCreated ?sdate  .
 }
WHERE
  {
    {
        ?le adm:logDate ?sdate .
        FILTER(?sdate > "2020-08-20T00:00:00"^^xsd:dateTime)
        ?le a adm:Synced .
        ?va adm:logEntry ?le ;
            adm:adminAbout ?v .
        ?v bdo:volumeOf ?iinstance .
        ?eadm adm:adminAbout ?iinstance .
    } union {
        ?le adm:logDate ?sdate .
        FILTER(?sdate > "2020-08-20T00:00:00"^^xsd:dateTime)
        ?le a adm:InitialDataCreation .
        ?eadm adm:logEntry ?le ;
            adm:adminAbout ?e .
        ?e a ?type .
        FILTER (?type != bdo:ImageGroup)
    }
  }

and takes 3.5s.

Best,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Andy Seaborne <an...@apache.org>.


On 02/09/2020 16:17, Élie Roux wrote:
> P.S.: Here's another aspect of the problem (although in a different
> aspect of the code). If I make the same query + filter on dates in
> unions, there seems to be no cache mechanism, as the following takes
> 3.5s instead of 1.7s:
> 
> select ?le ?Nla where {
>      {
>          ?le adm:logDate ?sdate .
>          FILTER(?sdate > "2021-08-20T00:00:00"^^xsd:dateTime)
>      } union {
>          ?la adm:logDate ?sdate .
>          FILTER(?sdate > "2021-08-20T00:00:00"^^xsd:dateTime)
>      }
> }

select ?le ?la where {
     ?le adm:logDate ?sdate .
     FILTER(?sdate > "2021-08-20T00:00:00"^^xsd:dateTime)
     BIND(?le AS ?la)
}


> 
> perhaps that could be the subject of another issue? Or is it another
> case that is too abnormal to be optimized? (In a real-life use case, I
> would but two different BGP after each FILTER of course)
> 
> Best,
>

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

P.S.: Here's another aspect of the problem (although in a different
aspect of the code). If I make the same query + filter on dates in
unions, there seems to be no cache mechanism, as the following takes
3.5s instead of 1.7s:

select ?le ?Nla where {
    {
        ?le adm:logDate ?sdate .
        FILTER(?sdate > "2021-08-20T00:00:00"^^xsd:dateTime)
    } union {
        ?la adm:logDate ?sdate .
        FILTER(?sdate > "2021-08-20T00:00:00"^^xsd:dateTime)
    }
}

perhaps that could be the subject of another issue? Or is it another
case that is too abnormal to be optimized? (In a real-life use case, I
would but two different BGP after each FILTER of course)

Best,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> This is what was discovered before - the cost of scanning and filtering
> isn't that high and why outlier cases may be measurable faster, the bulk
> of queries will be marginally faster. There is always a lot of things
> that can be done; it comes down to contributions and priorities.
>
> And the cost of the join? and of the CONSTRUCT? And if Fuseki, the HTTP
> costs which vary from trivial to a lot depending on result sizes.
> connection caching. The point is "it is complicated" and that means
> however good the point improvement is, it may not have a significant
> overall benefit.
>
> Investigation needed before jumping into implementation.

Well... I don't disagree that it is complicated. I now have a
relatively straightforward query that takes 1.7s even after a few
attempts:

select (count (distinct ?e) as ?count) where {
    {
        ?le adm:logDate ?sdate .
        FILTER(?sdate > "2020-08-20T00:00:00"^^xsd:dateTime)
        ?le a adm:Synced .
        ?va adm:logEntry ?le ;
            adm:adminAbout ?v .
        ?v bdo:volumeOf ?e .
    }
}

I'm reading the 1.7s in the Fuseki logs. Interestingly if I take a
value in the future for the date and get 0 result, the query still
takes 1.7s, for instance:

select ?le where {
    {
        ?le adm:logDate ?sdate .
        FILTER(?sdate > "2020-10-20T00:00:00"^^xsd:dateTime)
    }
}

So it's a bit hard for me to think the bottleneck could be
elsewhere... what other possible bottleneck should I look at?

Note that this contradicts previous findings where a similar query was
faster (around 300ms) if the indexes were not cold... but oddly I
can't reproduce it anymore, the 1.7s result has been consistent over
many queries in a short period of time, so the indexes were not
cold...

I understand it's a lot of code to write and that it's a big project, sorry.

> ARQ does not use the Model API. It's an extension to the ARQ algebra,
> OpExecutor, and subclasses, and one or more optimization Transforms to
> detect the case in a query.
>
> Overall, this isn't an API issue - it's the cost of implementing that
> API vs not doing something elsewhere.

Ok yes

Best,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Andy Seaborne <an...@apache.org>.

>> The reality is also that your case seems to be a bit unuusal. To be 3.5s
>> I'd guess you are hitting the POS (or quad equivalent) index cold. Or
>> something else is interacting with it (named graphs+union?)
> 
> Yes, I noticed after my initial email that later queries run much
> faster... now it's around 300ms which is much better. Still a bit slow
> but manageable.

This is what was discovered before - the cost of scanning and filtering 
isn't that high and why outlier cases may be measurable faster, the bulk 
of queries will be marginally faster. There is always a lot of things 
that can be done; it comes down to contributions and priorities.

And the cost of the join? and of the CONSTRUCT? And if Fuseki, the HTTP 
costs which vary from trivial to a lot depending on result sizes. 
connection caching. The point is "it is complicated" and that means 
however good the point improvement is, it may not have a significant 
overall benefit.

Investigation needed before jumping into implementation.

(I don't see how it works with dynamic inference.)

On 30/08/2020 10:46, Élie Roux wrote:
>>> Checking whether there is one first.
>>
>> Ok, I'll do that
> 
> Turns out there's already a 2011 issue about that:
> 
> https://issues.apache.org/jira/browse/JENA-144
> 
> I'm wondering if opening another issue about a request for a new API
> function would be relevant? Something in the lines of what Andy
> proposed:
> 
> StmtIterator listStatements(Resource s, Property p, RDFNode omin, RDFNode omax)
> 
> as well as perhaps a new
> 
> Interface RangedObjectSelector

ARQ does not use the Model API. It's an extension to the ARQ algebra, 
OpExecutor, and subclasses, and one or more optimization Transforms to 
detect the case in a query.

Overall, this isn't an API issue - it's the cost of implementing that 
API vs not doing something elsewhere.

     Andy

> 
> ?
> 
> Best,
>

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> > Checking whether there is one first.
>
> Ok, I'll do that

Turns out there's already a 2011 issue about that:

https://issues.apache.org/jira/browse/JENA-144

I'm wondering if opening another issue about a request for a new API
function would be relevant? Something in the lines of what Andy
proposed:

StmtIterator listStatements(Resource s, Property p, RDFNode omin, RDFNode omax)

as well as perhaps a new

Interface RangedObjectSelector

?

Best,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> Checking whether there is one first.

Ok, I'll do that

> The reality is also that your case seems to be a bit unuusal. To be 3.5s
> I'd guess you are hitting the POS (or quad equivalent) index cold. Or
> something else is interacting with it (named graphs+union?)

Yes, I noticed after my initial email that later queries run much
faster... now it's around 300ms which is much better. Still a bit slow
but manageable.

> Not everyone will be happy with the compromises necessary - it isn't
> make "normal cases work" (and, arguably, your case is not normal!),

Well... I understand, but I really think having this kind of
optimization (not just this specific one but implementing more of this
kind) would make SPARQL more attractive by making it looking more
production ready

> it's make other things stop working.
> e.g. The encoding of inline datetimes is only for CE (in fact years 0-7999).

I'm not sure I understand?

Best,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Andy Seaborne <an...@apache.org>.

On 29/08/2020 12:50, Élie Roux wrote:
> Hi all,
> 
> would opening an issue on JIRA be the right thing to do?
> 
> Best,
> 

Checking whether there is one first.

The reality is also that your case seems to be a bit unuusal. To be 3.5s 
I'd guess you are hitting the POS (or quad equivalent) index cold. Or 
something else is interacting with it (named graphs+union?)

So profiling and investigation of the system would be a useful input.

Not everyone will be happy with the compromises necessary - it isn't 
make "normal cases work" (and, arguably, your case is not normal!), it's 
make other things stop working.

e.g. The encoding of inline datetimes is only for CE (in fact years 0-7999).

     Andy

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

Hi all,

would opening an issue on JIRA be the right thing to do?

Best,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> (in memory or TDB?)

TDB1

> > - are there other ways to write this query to make it more performant?
> Not in ARQ ubnelss there are less adm:logEntry triples.

No

> The access to data triples is (S,P,O) where any of S/P/O can be ANY.
> So you have (ANY, adm:logDate, ANY)
>
> Ideally, that would be for TDB:
>
> (ANY, adm:logDate, ANY, start O at "2020-08-01T00:00:00"^^xsd:dateTime)
>
> or generally
>
> (ANY, adm:logDate, ANY, min O, max O)

Yes exactly! That would be a very good solution for this use case

> There are complications with illegal literals, mixed types, and encoding
> restrictions  etc. but in TDB2 "2020-08-01T00:00:00"^^xsd:dateTime is
> stored inline in the ) slot as binary so the index is partially sorted
> for valid data,
>
> There are precision limits in the encoding for XSD datatime:
> Only to millesecond accuracy, timezones must be units of 15 min (which
> true for all valid tz at the time of writing).
>
> Invalid terms are not recorded inline. They are recorded faithfully but
> it means the abbreviated range isn't going to see them.

Ok yes, I imagine there is a lot of edge cases indeed... but I'm happy
even if only normal cases work

> > - is it worth reporting this on the SPARQL 1.2 github repo?
>
> It is an implementation issue, not a language design issue.

Ok yes.

Thanks!
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Élie Roux <el...@telecom-bretagne.eu>.

> I'm wondering whether or not using xsd:long instead of xsd:dataTime with
> timestamps mapped to milliseconds in numerical form would not perform
> better.

well, I think it's convenient to have date times represented as
xsd:dateTime in RDF... now yes, ideally they could be mapped to long
timestamps in TDB. This wouldn't really change the core issue though,
which is that all the triples are fetched, and only then their object
is compared, instead of doing the comparison at the time of fetch. I
can't really test converting my values to longs though so I can't
really say...

Best,
-- 
Elie

Re: shortcut for querying dates fast?

Posted by Piotr Nowara <pi...@gmail.com>.

Hi,

I'm wondering whether or not using xsd:long instead of xsd:dataTime with
timestamps mapped to milliseconds in numerical form would not perform
better.

Best,
Piotr

pt., 28 sie 2020 o 13:15 Andy Seaborne <an...@apache.org> napisał(a):

>
>
> On 27/08/2020 09:10, Élie Roux wrote:
> > Dear all,
> >
> > I have a dataset with (among other things) about 400,000 triples in the
> form
>
> (in memory or TDB?)
>
> >
> > ?a adm:logDate ?d
> >
> > where ?d is an xsd:dateTime. I'm writing a query to get all the
> > triples that have a ?d in a certain interval. There are usually very
> > few of them (around say 200). I'm writing a query that looks like
> >
> > construct {
> >      ?va  adm:hasactivityon ?d .
> > } where {
> >      ?le adm:logDate ?d .
> >       FILTER(?d > "2020-08-01T00:00:00"^^xsd:dateTime)
> >      ?va adm:logEntry ?le .
> > }
> >
> > But it's too slow for our purpose (3.5s). I suspect it's conceptually
> > simple to have very performant implementation (using an index
> > dedicated to xsd:dateTime literal that could be queried), but I also
> > suspect SPARQL doesn't make that kind of performant algorithm to
> > summon in such a query (which is a mix of a bgp and a filter instead
> > of a direct call to a performant index).
> >
> > So a few questions:
> > - are there other ways to write this query to make it more performant?
> Not in ARQ ubnelss there are less adm:logEntry triples.
>
> > - my impression is that what I want with time is similar to what
> > GeoSPARQL provides for space... is there something similar to
> > GeoSPARQL for time?
> > - would that kind of performant index require the same type of
> > mechanism as the jena:text extension?
>
> The access to data triples is (S,P,O) where any of S/P/O can be ANY.
> So you have (ANY, adm:logDate, ANY)
>
> Ideally, that would be for TDB:
>
> (ANY, adm:logDate, ANY, start O at "2020-08-01T00:00:00"^^xsd:dateTime)
>
> or generally
>
> (ANY, adm:logDate, ANY, min O, max O)
>
> There are complications with illegal literals, mixed types, and encoding
> restrictions  etc. but in TDB2 "2020-08-01T00:00:00"^^xsd:dateTime is
> stored inline in the ) slot as binary so the index is partially sorted
> for valid data,
>
> There are precision limits in the encoding for XSD datatime:
> Only to millesecond accuracy, timezones must be units of 15 min (which
> true for all valid tz at the time of writing).
>
> Invalid terms are not recorded inline. They are recorded faithfully but
> it means the abbreviated range isn't going to see them.
>
> > - is it worth reporting this on the SPARQL 1.2 github repo?
>
> It is an implementation issue, not a language design issue.
>
> >
> > Thanks in advance,
> >
>
>      Andy
>

Re: shortcut for querying dates fast?

Posted by Andy Seaborne <an...@apache.org>.

On 27/08/2020 09:10, Élie Roux wrote:
> Dear all,
> 
> I have a dataset with (among other things) about 400,000 triples in the form

(in memory or TDB?)

> 
> ?a adm:logDate ?d
> 
> where ?d is an xsd:dateTime. I'm writing a query to get all the
> triples that have a ?d in a certain interval. There are usually very
> few of them (around say 200). I'm writing a query that looks like
> 
> construct {
>      ?va  adm:hasactivityon ?d .
> } where {
>      ?le adm:logDate ?d .
>       FILTER(?d > "2020-08-01T00:00:00"^^xsd:dateTime)
>      ?va adm:logEntry ?le .
> }
> 
> But it's too slow for our purpose (3.5s). I suspect it's conceptually
> simple to have very performant implementation (using an index
> dedicated to xsd:dateTime literal that could be queried), but I also
> suspect SPARQL doesn't make that kind of performant algorithm to
> summon in such a query (which is a mix of a bgp and a filter instead
> of a direct call to a performant index).
> 
> So a few questions:
> - are there other ways to write this query to make it more performant?
Not in ARQ ubnelss there are less adm:logEntry triples.

> - my impression is that what I want with time is similar to what
> GeoSPARQL provides for space... is there something similar to
> GeoSPARQL for time?
> - would that kind of performant index require the same type of
> mechanism as the jena:text extension?

The access to data triples is (S,P,O) where any of S/P/O can be ANY.
So you have (ANY, adm:logDate, ANY)

Ideally, that would be for TDB:

(ANY, adm:logDate, ANY, start O at "2020-08-01T00:00:00"^^xsd:dateTime)

or generally

(ANY, adm:logDate, ANY, min O, max O)

There are complications with illegal literals, mixed types, and encoding 
restrictions  etc. but in TDB2 "2020-08-01T00:00:00"^^xsd:dateTime is 
stored inline in the ) slot as binary so the index is partially sorted 
for valid data,

There are precision limits in the encoding for XSD datatime:
Only to millesecond accuracy, timezones must be units of 15 min (which 
true for all valid tz at the time of writing).

Invalid terms are not recorded inline. They are recorded faithfully but 
it means the abbreviated range isn't going to see them.

> - is it worth reporting this on the SPARQL 1.2 github repo?

It is an implementation issue, not a language design issue.

> 
> Thanks in advance,
> 

     Andy