You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "Ozga, Rafal" <r....@nature.com> on 2012/01/31 13:49:47 UTC

Working with Sparql Algebra

Hi,

We¹re trying to build a caching layer in the front of our triplestore
server. But to get the most out of such a cache, it¹s
quite important to normalize incoming SPARQL queries so that instead of
directly hitting the backend triplestore for the query:

select * where {?subject ?predicate ?object}

we could reuse the results already cached for the previous query, say:

select * where {?s ?p ?o}

as despite of different variable names those queries really express the very
same information need¹ if you like.

We thought that the best way of doing so would be to use the feature that
ARQ has in store, namely SPARQL Algebra.
The problem with that approach is, however, that the algebras being produced
by Algebra.compile(query) routine are not ³abstract enough²
for our purpose. For the former queries we would get respectively:

(bgp (triple ?subject ?predicate ?object))

(bgp (triple ?s ?p ?o))


So the question is whether we are missing something and there is indeed a
general way of obtaining the algebras that wouldn¹t be any different for
semantically identical queries (like those two above) or we have to resort
to some built in-house tweaks and tricks (e.g. normalizing variable names,
sorting algebra nodes on the same tree level alphabetically etc. etc.) ?

Kind regards,

Rafal Ozga 



********************************************************************************   
DISCLAIMER: This e-mail is confidential and should not be used by anyone who is
not the original intended recipient. If you have received this e-mail in error
please inform the sender and delete it from your mailbox or any other storage
mechanism. Neither Macmillan Publishers Limited nor any of its agents accept
liability for any statements made which are clearly the sender's own and not
expressly made on behalf of Macmillan Publishers Limited or one of its agents.
Please note that neither Macmillan Publishers Limited nor any of its agents
accept any responsibility for viruses that may be contained in this e-mail or
its attachments and it is your responsibility to scan the e-mail and 
attachments (if any). No contracts may be concluded on behalf of Macmillan 
Publishers Limited or its agents by means of e-mail communication. Macmillan 
Publishers Limited Registered in England and Wales with registered number 785998 
Registered Office Brunel Road, Houndmills, Basingstoke RG21 6XS   
********************************************************************************

Re: Working with Sparql Algebra

Posted by Bill Roberts <bi...@swirrl.com>.

> 
> 
>> We¹re trying to build a caching layer in the front of our
>> triplestore server. But to get the most out of such a cache, it¹s 
>> quite important to normalize incoming SPARQL queries...
> 
> Indeed, although don't underestimate the power of a very simple minded
> cache. In some of my work we did no normalisation and it was still
> very effective.
> 

Like Damian, my experience has been that caching without normalisation can be very effective - and of course it's very simple to implement.

Re: Working with Sparql Algebra

Posted by Paolo Castagna <ca...@googlemail.com>.

Andy Seaborne wrote:
>> There is a (prototype) SPARQL cache that does paging by stripping off
>> LIMIT/OFFSET, issuing the whole (sorted) query and caching the results.
>>  Then, later requests with different LIMIT/OFFSET are sliced out of the
>> cached results.

A couple of links for Rafal:

  // The query cache
  val queryCacheMap = MMap[Query, RS]()

https://github.com/afs/LD-Access/blob/master/src/main/java/org/apache/jena/sparqlcache/SparqlCache.scala
https://github.com/afs/LD-Access/blob/master/src/main/java/org/apache/jena/sparqlcache/RS.scala

So, the cache is a map from Query objects to RS (which do slicing when offset
and/or limit are specified).

Rafal, are you using Java to implement your cache layer?
Is/will it be open source?

Paolo

PS:
You find also another small experiment here:
https://github.com/castagna/sparql-cache
... not as clever as LD-Access.

Re: Working with Sparql Algebra

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Rafal,
you have already received many good suggestions.

While ago I had a pretty similar problem, my use case was slightly different
though: query log analysis... but I had the problem to find "equivalent"
queries. Here is the thread: http://markmail.org/thread/vuvd2g7p2yqxtx52
from jena-users. You might find it useful.

... and, I can only agree with all the +1 around how useful can be a caching
layer, even if naively implemented. ;-)

Please, do keep us updated on your progress.

My 2 cents,
Paolo

Re: Working with Sparql Algebra

Posted by Andy Seaborne <an...@apache.org>.

On 31/01/12 14:50, Ozga, Rafal wrote:
> Fair enough, the single comparison is indeed in O(size(algebra_expression))
> but then doing it naively I would have to compare my tree with all the
> already existing ones in the cache. And that's not really an option I think.
> Unless there is some way of ordering the algebras so I could compare mine
> with like logarithm or so of the number of already existing algebras in the
> cache ?
>
> But I guess there isn't any obvious way of doing so, so it seems that the
> option for me is to implement an OpVisitor and traverse the algebra tree
> modifying the subsequent nodes one by one, isn't  ?
>
> Speaking of NodeIsomorphismMap. How do you make the bgp (triple ?s ?p ?o)
> isomorphic with the infinite number of patterns bgp (triple ?s' ?p' ?o')
> that are different only by variable names ?

You can do better than that :-)

Approach 1:

Canonicalize patterns so the variables are allocated ?a1 ?a2 as 
encounter in the walk of the algebra expression.  This canonical pattern 
can be used in a normal hash table.  It's better to execute the 
canonical form - makes rewriting the results easier.

Approach 2:

A special hash function where the hash function is var name insensitive 
(position sensitive would be good but just missing vars will work) which 
will bucket the incoming.  Test within the bucket is still as you 
describe but (hopefully!) much smaller set of queries.  Just the way 
predicate URIs are used is probably fairly distinguishing.

If you produce either isomorphic-by-var-bNode or the canonical form 
code, please consider contributing it to Jena.

	Andy

Re: Working with Sparql Algebra

Posted by "Ozga, Rafal" <r....@nature.com>.

Fair enough, the single comparison is indeed in O(size(algebra_expression))
but then doing it naively I would have to compare my tree with all the
already existing ones in the cache. And that's not really an option I think.
Unless there is some way of ordering the algebras so I could compare mine
with like logarithm or so of the number of already existing algebras in the
cache ? 

But I guess there isn't any obvious way of doing so, so it seems that the
option for me is to implement an OpVisitor and traverse the algebra tree
modifying the subsequent nodes one by one, isn't  ?

Speaking of NodeIsomorphismMap. How do you make the bgp (triple ?s ?p ?o)
isomorphic with the infinite number of patterns bgp (triple ?s' ?p' ?o')
that are different only by variable names ?




>>> So the question is whether we are missing something and there is
>>> indeed a general way of obtaining the algebras that wouldn¹t be any
>>> different for semantically identical queries (like those two above)
>>> or we have to resort to some built in-house tweaks and tricks (e.g.
>>> normalizing variable names, sorting algebra nodes on the same tree
>>> level alphabetically etc. etc.) ?
>> 
>> There might well be transformers doing some of this work (ISTR a
>> variable reassignment mechanism) -- Andy would know best. I wrote a
>> little piece on manipulating sparql [1] which shows a simple transformer.
> 
> Yes.
> 
> Every Op has an .equalTo method.  This is not .equals().
> 
>    Op.equalTo(Op other, NodeIsomorphismMap labelMap)
> 
> 
> Op.equals is implemented as a call of .equalTo with a bNode mapping
> isomorphism map.
> 
> You could write your own NodeIsomorphismMap (inherit) and change it to
> be variable name mapping as well.  Then you would get the isomorphic
> equality test you need to do this.
> 
> If there is an isomorphism mapping, then the code will return true and
> what is more, the map will give the association of names so you can
> remap the variables in the result as well.
> 
> The algorithm is linear in the size of the algebra expression because
> expressions are trees so it's walking two trees to see if one can be
> isomorphically mapped to the other.
> 
> Andy
> 
>


********************************************************************************   
DISCLAIMER: This e-mail is confidential and should not be used by anyone who is
not the original intended recipient. If you have received this e-mail in error
please inform the sender and delete it from your mailbox or any other storage
mechanism. Neither Macmillan Publishers Limited nor any of its agents accept
liability for any statements made which are clearly the sender's own and not
expressly made on behalf of Macmillan Publishers Limited or one of its agents.
Please note that neither Macmillan Publishers Limited nor any of its agents
accept any responsibility for viruses that may be contained in this e-mail or
its attachments and it is your responsibility to scan the e-mail and 
attachments (if any). No contracts may be concluded on behalf of Macmillan 
Publishers Limited or its agents by means of e-mail communication. Macmillan 
Publishers Limited Registered in England and Wales with registered number 785998 
Registered Office Brunel Road, Houndmills, Basingstoke RG21 6XS   
********************************************************************************

Re: Working with Sparql Algebra

Posted by Andy Seaborne <an...@apache.org>.

On 31/01/12 13:10, Damian Steer wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 31/01/12 12:49, Ozga, Rafal wrote:
>> Hi,
>
> Hi Rafal,
>
>> We¹re trying to build a caching layer in the front of our
>> triplestore server. But to get the most out of such a cache, it¹s
>> quite important to normalize incoming SPARQL queries...
>
> Indeed, although don't underestimate the power of a very simple minded
> cache. In some of my work we did no normalisation and it was still
> very effective.

+1

It can be surprising just how many identical queries arrive.

There is a (prototype) SPARQL cache that does paging by stripping off 
LIMIT/OFFSET, issuing the whole (sorted) query and caching the results. 
  Then, later requests with different LIMIT/OFFSET are sliced out of the 
cached results.

>> We thought that the best way of doing so would be to use the
>> feature that ARQ has in store, namely SPARQL Algebra.
>
> +1
>
>> The problem with that approach is, however, that the algebras being
>> produced by Algebra.compile(query) routine are not ³abstract
>> enough² for our purpose. For the former queries we would get
>> respectively:
>>
>> (bgp (triple ?subject ?predicate ?object))
>>
>> (bgp (triple ?s ?p ?o))
>
> Yes, Algebra.compile in itself does no work on the form of the
> operations as far as I'm aware. However that's only the beginning of
> the process. ARQ (and individual query engines) spend a lot of their
> time transforming algebras.

Caveat: those are not quite the same query unless something else is 
going on.

If there are SELECT queries, then the variable names of the column are 
different.

If it's a CONSTRUCT or DESCRIBE query, then, depending on the template 
for CONSTRUCT, they can be the same.

>
>> So the question is whether we are missing something and there is
>> indeed a general way of obtaining the algebras that wouldn¹t be any
>> different for semantically identical queries (like those two above)
>> or we have to resort to some built in-house tweaks and tricks (e.g.
>> normalizing variable names, sorting algebra nodes on the same tree
>> level alphabetically etc. etc.) ?
>
> There might well be transformers doing some of this work (ISTR a
> variable reassignment mechanism) -- Andy would know best. I wrote a
> little piece on manipulating sparql [1] which shows a simple transformer.

Yes.

Every Op has an .equalTo method.  This is not .equals().

   Op.equalTo(Op other, NodeIsomorphismMap labelMap)

Op.equals is implemented as a call of .equalTo with a bNode mapping 
isomorphism map.

You could write your own NodeIsomorphismMap (inherit) and change it to 
be variable name mapping as well.  Then you would get the isomorphic 
equality test you need to do this.

If there is an isomorphism mapping, then the code will return true and 
what is more, the map will give the association of names so you can 
remap the variables in the result as well.

The algorithm is linear in the size of the algebra expression because 
expressions are trees so it's walking two trees to see if one can be 
isomorphically mapped to the other.

	Andy

>
> Damian
>
> [1]
> <http://incubator.apache.org/jena/documentation/query/manipulating_sparql_using_arq.html>
> -----BEGIN PGP SIGNATURE-----
> Version: GnuPG v1.4.11 (GNU/Linux)
> Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/
>
> iEYEARECAAYFAk8n6FAACgkQAyLCB+mTtylfgwCg8td99GDeryaPqrJpIX2PTzhQ
> lcYAoPGSMFc038apOpz1bSjbGlOs+pbk
> =Uvwc
> -----END PGP SIGNATURE-----

Re: Working with Sparql Algebra

Posted by Damian Steer <d....@bristol.ac.uk>.

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 31/01/12 12:49, Ozga, Rafal wrote:
> Hi,

Hi Rafal,

> We¹re trying to build a caching layer in the front of our
> triplestore server. But to get the most out of such a cache, it¹s 
> quite important to normalize incoming SPARQL queries...

Indeed, although don't underestimate the power of a very simple minded
cache. In some of my work we did no normalisation and it was still
very effective.

> We thought that the best way of doing so would be to use the
> feature that ARQ has in store, namely SPARQL Algebra.

+1

> The problem with that approach is, however, that the algebras being
> produced by Algebra.compile(query) routine are not ³abstract
> enough² for our purpose. For the former queries we would get
> respectively:
> 
> (bgp (triple ?subject ?predicate ?object))
> 
> (bgp (triple ?s ?p ?o))

Yes, Algebra.compile in itself does no work on the form of the
operations as far as I'm aware. However that's only the beginning of
the process. ARQ (and individual query engines) spend a lot of their
time transforming algebras.

> So the question is whether we are missing something and there is
> indeed a general way of obtaining the algebras that wouldn¹t be any
> different for semantically identical queries (like those two above)
> or we have to resort to some built in-house tweaks and tricks (e.g.
> normalizing variable names, sorting algebra nodes on the same tree
> level alphabetically etc. etc.) ?

There might well be transformers doing some of this work (ISTR a
variable reassignment mechanism) -- Andy would know best. I wrote a
little piece on manipulating sparql [1] which shows a simple transformer.

Damian

[1]
<http://incubator.apache.org/jena/documentation/query/manipulating_sparql_using_arq.html>
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iEYEARECAAYFAk8n6FAACgkQAyLCB+mTtylfgwCg8td99GDeryaPqrJpIX2PTzhQ
lcYAoPGSMFc038apOpz1bSjbGlOs+pbk
=Uvwc
-----END PGP SIGNATURE-----