You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Andy Seaborne <an...@epimorphics.com> on 2011/03/28 10:07:43 UTC

Re: TDB and FILTER (?start > "2010-03-27T00:00:00Z"^^xsd:dateTime) && (?end < "2010-03-28T00:00:00Z"^^xsd:dateTime)


On 28/03/11 08:49, Paolo Castagna wrote:
> Hi,
> I sometimes see queries like this one:
>
> SELECT *
> WHERE {
> ...
> ?event :start ?start .
> ?event :end ?end .
> FILTER (
> (?start > "2010-03-27T00:00:00Z"^^xsd:dateTime) &&
> (?end < "2010-03-28T00:00:00Z"^^xsd:dateTime)
> )
> }
>
> The FILTER is scanning the indexes and the node table to find the values
> which satisfy the filter expression.
>
> With TDB, if you have a large dataset, this query can be slow.
>
> TDB already encodes certain node values (including DateTime) inline in
> the node ids and the good news is that the encoding scheme preserves the
> order.
>
> See NodeId's inline(Node node) method:
> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src/main/java/com/hp/hpl/jena/tdb/store/NodeId.java
>
> And, DateTimeNode, for example:
> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src/main/java/com/hp/hpl/jena/tdb/store/DateTimeNode.java
>
>
> However, I am not sure these in line node ids are at the moment used at
> query time. Am I right?

The dataTime value are rebuilt directly from the bits stored in the 
NodeId,  There isn't an entry in the node table for value directly 
encoded.  It uses unpackDateTime

The custom "NumberUtils.formatInt" stuff is because it's appreciably 
faster than the standard java operations for paring integers which are 
locale sensitive but that isn't needed here.

Inlcing into Nodeids is added by NodeTableInlin and it calls 
NodeId.inline(node) ;

> This is probably not a trivial change, but one worth aiming at.

What could be done is add better range index scans that started at a 
particular value, and not look for "any" in that slot.  However, the 
biggest benefit is not hitting the NodeTable at all which is already there.

> In theory, it should speed up this kind of FILTER expressions and TDB
> will be able to answer certain queries without touching the node table
> or scanning a large portion of your data just to find a few values.
>
> Another very similar use case is with queries involving locations (i.e.
> latitude and longitude). Sometimes you want to find things within a
> bounded box, therefore you have a similar expression for latitude values
> and one for longitude values.
>
> Is it worth opening a JIRA issue (i.e. a feature request) for this?
>
> Paolo

	Andy

Re: TDB and FILTER (?start > "2010-03-27T00:00:00Z"^^xsd:dateTime) && (?end < "2010-03-28T00:00:00Z"^^xsd:dateTime)

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> 
>>> What could be done is add better range index scans that started at a
>>> particular value, and not look for "any" in that slot. However, the
>>> biggest benefit is not hitting the NodeTable at all which is already
>>> there.
>>
>> Good.
>>
>> So, if I have a TDB store full of events, we currently scan the entire
>> SPO index while we could scan just the fraction with values which satisfy
>> the FILTER expression.
> 
> More likely to look in POS, for fixed P, so it's a partial scan.
> 
> It'll look for [(P,0,0), (P+1,0,0)) fro P is the NodeId

I see, thanks.

>> If I have a large TDB store full of "events" this could still speed up
>> those queries, correct?
> 
> Yes, for a databse full such events.  In the case you describe then it 
> might be beneficial to do
> 
> [(P,X1,0), (P,X2,0))
> 
> where X1 and X2 are the Nodeid encodings for the limits.  It would need 
> a lot of data before that's going to be significant though - and also it 
> wil depend on whether it's better to do the rest of the BGP or the 
> constrained range scan.

Ack.

Another dirty workaround is adding more data to make the query more selective
(by YEAR, MONTH, DAY for example).

Paolo

>> I'll have a look (and come back with a description of how I understand
>> what could be done here).
> 
>>
>> Thanks,
>> Paolo
> 
>     Andy

Re: TDB and FILTER (?start > "2010-03-27T00:00:00Z"^^xsd:dateTime) && (?end < "2010-03-28T00:00:00Z"^^xsd:dateTime)

Posted by Andy Seaborne <an...@epimorphics.com>.

>> What could be done is add better range index scans that started at a
>> particular value, and not look for "any" in that slot. However, the
>> biggest benefit is not hitting the NodeTable at all which is already
>> there.
>
> Good.
>
> So, if I have a TDB store full of events, we currently scan the entire
> SPO index while we could scan just the fraction with values which satisfy
> the FILTER expression.

More likely to look in POS, for fixed P, so it's a partial scan.

It'll look for [(P,0,0), (P+1,0,0)) fro P is the NodeId

> If I have a large TDB store full of "events" this could still speed up
> those queries, correct?

Yes, for a databse full such events.  In the case you describe then it 
might be beneficial to do

[(P,X1,0), (P,X2,0))

where X1 and X2 are the Nodeid encodings for the limits.  It would need 
a lot of data before that's going to be significant though - and also it 
wil depend on whether it's better to do the rest of the BGP or the 
constrained range scan.

> I'll have a look (and come back with a description of how I understand
> what could be done here).

>
> Thanks,
> Paolo

	Andy

Re: TDB and FILTER (?start > "2010-03-27T00:00:00Z"^^xsd:dateTime) && (?end < "2010-03-28T00:00:00Z"^^xsd:dateTime)

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
> 
> 
> On 28/03/11 08:49, Paolo Castagna wrote:
>> Hi,
>> I sometimes see queries like this one:
>>
>> SELECT *
>> WHERE {
>> ...
>> ?event :start ?start .
>> ?event :end ?end .
>> FILTER (
>> (?start > "2010-03-27T00:00:00Z"^^xsd:dateTime) &&
>> (?end < "2010-03-28T00:00:00Z"^^xsd:dateTime)
>> )
>> }
>>
>> The FILTER is scanning the indexes and the node table to find the values
>> which satisfy the filter expression.
>>
>> With TDB, if you have a large dataset, this query can be slow.
>>
>> TDB already encodes certain node values (including DateTime) inline in
>> the node ids and the good news is that the encoding scheme preserves the
>> order.
>>
>> See NodeId's inline(Node node) method:
>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src/main/java/com/hp/hpl/jena/tdb/store/NodeId.java 
>>
>>
>> And, DateTimeNode, for example:
>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src/main/java/com/hp/hpl/jena/tdb/store/DateTimeNode.java 
>>
>>
>>
>> However, I am not sure these in line node ids are at the moment used at
>> query time. Am I right?
> 
> The dataTime value are rebuilt directly from the bits stored in the 
> NodeId,  There isn't an entry in the node table for value directly 
> encoded.  It uses unpackDateTime
> 
> The custom "NumberUtils.formatInt" stuff is because it's appreciably 
> faster than the standard java operations for paring integers which are 
> locale sensitive but that isn't needed here.
> 
> Inlcing into Nodeids is added by NodeTableInlin and it calls 
> NodeId.inline(node) ;

Ok, I was missing something here, then.

>> This is probably not a trivial change, but one worth aiming at.
> 
> What could be done is add better range index scans that started at a 
> particular value, and not look for "any" in that slot.  However, the 
> biggest benefit is not hitting the NodeTable at all which is already there.

Good.

So, if I have a TDB store full of events, we currently scan the entire
SPO index while we could scan just the fraction with values which satisfy
the FILTER expression.

If I have a large TDB store full of "events" this could still speed up
those queries, correct?

I'll have a look (and come back with a description of how I understand
what could be done here).

Thanks,
Paolo

>> In theory, it should speed up this kind of FILTER expressions and TDB
>> will be able to answer certain queries without touching the node table
>> or scanning a large portion of your data just to find a few values.
>>
>> Another very similar use case is with queries involving locations (i.e.
>> latitude and longitude). Sometimes you want to find things within a
>> bounded box, therefore you have a similar expression for latitude values
>> and one for longitude values.
>>
>> Is it worth opening a JIRA issue (i.e. a feature request) for this?
>>
>> Paolo
> 
>     Andy