You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Paolo Castagna <ca...@googlemail.com> on 2011/03/11 13:20:39 UTC

SPARQL query equality and "equivalence"...

Hi,
let's say I have two SPARQL query strings:

---- Q1 ----
PREFIX : <http://example/>

SELECT ?v
{ :x :p ?v . FILTER(?v = 1) }
------------

---- Q2 ----
PREFIX : <http://example/>

SELECT ?v
{
     :x :p ?v .
     FILTER ( ?v = 1 )
}
------------

Once the queries are parsed into two Query objects (say, q1 and q2),
I can use q1.equals(q2) to check if two queries are equal (i.e. exactly
the same, modulo extra spaces/formatting). Great.

However, let's look at another query string:

---- Q3 ----
PREFIX : <http://example/>

SELECT ?y
{
     :x :p ?y .
     FILTER ( ?y = 1 )
}
------------

The only difference between Q3 and Q1|Q2 is that we have a different
variable name (i.e. ?y instead of ?v). The queries are different, but
they are "equivalent": they are structurally the same, only variable
names are different.

Is there already in ARQ a way to compare queries to see if they are
equivalent or not?

I found the equiv method in BasicPattern, is there something similar
for entire queries?

Thanks,
Paolo

Re: SPARQL query equality and "equivalence"...

Posted by Paolo Castagna <ca...@googlemail.com>.


Andy Seaborne wrote:
> Both Query and Op implement .equals(Object) and .hashCode() based on 
> structural equivalence.  Query.equal is older (pre algebra!) - hence 
> having two.
> 
> 
> 
> The contract is that two queries being .equal will behave the same - 
> evaluate to the same results on any possible data and print the same.
> 
> Currently, it's structural equivalance + isomorphic equivalnce based on 
> mapping bNodes.  Renaming bnodes consistently does not change the 
> answers because bNodes-as-variables can't be seen in the results.
> 
> 
> hich is good, because the bNodes are probably named differently if they 
> are parser-generated anyway.
> 
> 
> Q1 and Q3 are not equal by that definition.  The column in the results 
> has been renamed from "v" to "y".

Yep.

> You could add that as a "IsomorphicByVaribleNames(query1, query2)" - but 
> as .equals() as it violates Javas rules for .equals().

Ok.

> The isomorphism is, luckily an easy case with no backtracking needed. 
> Just allocate an isomorphism mapping (it's carried around by 
> NodeIsomorphismMap for bNode labels - they really will be variables 
> called ??0 etc.) whenever an unseen thing is seen.  All the 
> structuralisms that might need backtracking would make queries with 
> different print forms.

Thank you Andy, I'll try this.

Paolo

> 
>     Andy
> 
> 
> On 11/03/11 12:59, Damian Steer wrote:
>>
>> On 11 Mar 2011, at 12:30, Damian Steer wrote:
>>>
>>>> The only difference between Q3 and Q1|Q2 is that we have a different
>>>> variable name (i.e. ?y instead of ?v). The queries are different, but
>>>> they are "equivalent": they are structurally the same, only variable
>>>> names are different.
>>>
>>> I don't think it handles this (but I may be wrong).
>>
>> Ah, Op#equalTo(Op other, NodeIsomorphismMap labelMap) would do that, 
>> once you have the mapping.
>>
>> Damian
>>

Re: SPARQL query equality and "equivalence"...

Posted by Andy Seaborne <an...@epimorphics.com>.

Both Query and Op implement .equals(Object) and .hashCode() based on 
structural equivalence.  Query.equal is older (pre algebra!) - hence 
having two.

The contract is that two queries being .equal will behave the same - 
evaluate to the same results on any possible data and print the same.

Currently, it's structural equivalance + isomorphic equivalnce based on 
mapping bNodes.  Renaming bnodes consistently does not change the 
answers because bNodes-as-variables can't be seen in the results.

hich is good, because the bNodes are probably named differently if they 
are parser-generated anyway.

Q1 and Q3 are not equal by that definition.  The column in the results 
has been renamed from "v" to "y".

You could add that as a "IsomorphicByVaribleNames(query1, query2)" - but 
as .equals() as it violates Javas rules for .equals().

The isomorphism is, luckily an easy case with no backtracking needed. 
Just allocate an isomorphism mapping (it's carried around by 
NodeIsomorphismMap for bNode labels - they really will be variables 
called ??0 etc.) whenever an unseen thing is seen.  All the 
structuralisms that might need backtracking would make queries with 
different print forms.

	Andy

On 11/03/11 12:59, Damian Steer wrote:
>
> On 11 Mar 2011, at 12:30, Damian Steer wrote:
>>
>>> The only difference between Q3 and Q1|Q2 is that we have a different
>>> variable name (i.e. ?y instead of ?v). The queries are different, but
>>> they are "equivalent": they are structurally the same, only variable
>>> names are different.
>>
>> I don't think it handles this (but I may be wrong).
>
> Ah, Op#equalTo(Op other, NodeIsomorphismMap labelMap) would do that, once you have the mapping.
>
> Damian
>

Re: SPARQL query equality and "equivalence"...

Posted by Andy Seaborne <an...@epimorphics.com>.


On 15/03/11 19:39, Paolo Castagna wrote:
> Hi Andy.
>
> Andy Seaborne wrote:
>>
>>> I had a first go at implementing a cached QueryEngineHTTP:
>>> https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/CachedQueryEngineHTTP.java
>>>
>>>
>>> Invalidation is by service end-point and this approach makes sense only
>>> if you have mostly reads with few and non frequent updates.
>>>
>>> I am not happy with all those synchronized and the static cache.
>>
>> synchronized (cache) {
>> if (cache.containsKey(key)) {
>> return (ResultSetRewindable) cache.get(key);
>> }
>> }
>> rs = ResultSetFactory.makeRewindable(super.execSelect());
>> synchronized (cache) {
>> cache.put(key, rs);
>> }
>>
>> The synchronization needs to be over the get and the put together.
>> What if another thread comes in after the first block? Both find .get
>> is null.
>
> This was my first solution, then I tried to be clever...
> I wanted to reduce as much as possible the synchronized part and in
> particular
> I wanted not to have a potentially long operation inside the synchronized
> block. The risk is to have two threads making the query remotely and adding
> the result to the cache.
>
> My fear was that the approach below will reduce concurrency on the client,
> but this is probably not true.

The lock is held for a short time if there is a cache hit and long time 
on a cache miss.  But for the cache miss any thread blocked is waiting 
for this result and would otherwise make the call itself (and slow the 
far end down !).

>
>> You then have two threads executing the makeRewindable/execSelect This
>> happens to be safe because of the operation super.execSelect() is safe
>> in parallel but in general the operation being cached is not thread safe.
>>
>> Generally this is safer: a single sync over get and the
>> put-if-cache-miss. And in this case just as fast, and makes only one
>> call to the far end:
>> (thread 2 waits on the lock, not makes the query on the remote end):
>>
>> synchronized (cache) {
>> if (cache.containsKey(key)) {
>> return (ResultSetRewindable) cache.get(key);
>> rs = ResultSetFactory.makeRewindable(super.execSelect());
>> cache.put(key, rs);
>> }
>
> I'll follow your suggestion.
>
> I also added one which uses Memcached and one which uses Redis as well:
> https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/MemcachedQueryEngineHTTP.java
>
> https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/RedisQueryEngineHTTP.java
>
>
> Cache invalidation is the problem. :-)

:-)

And remote SPARQL endpoints don't manage etags or cache life times very 
well - and can't if random updates appear.

Maybe keep entries for a fixed length of time.

An entry can be checked by asking the server some cheap operation if it 
supports ETags in some useful manner.  If the server provides the same 
etag for all requests between updates, life is OK.  A bit of fixed time 
(to catch very high frequency requests) and

The HTTP request "If-None-Match" (and a 304 response) is also possible 
but it is an extra round trip if the data has changed.

(memo to self: add etags to Fuseki)

> I've tried to follow your advice on implementing a new QueryEngine
> with a factory/implementation pair, but I got confused... it was
> late and this is somehow "out-of-band" activity.
>
> However, I would like to have a good/proper example to show how you
> can have a cached query engine either local or remote.
>
> Thanks,
> Paolo
>
>>
>> Andy

Re: SPARQL query equality and "equivalence"...

Posted by Paolo Castagna <ca...@googlemail.com>.

Hi Andy.

Andy Seaborne wrote:
> 
>> I had a first go at implementing a cached QueryEngineHTTP:
>> https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/CachedQueryEngineHTTP.java 
>>
>>
>> Invalidation is by service end-point and this approach makes sense only
>> if you have mostly reads with few and non frequent updates.
>>
>> I am not happy with all those synchronized and the static cache.
> 
>   synchronized (cache) {
>      if (cache.containsKey(key)) {
>            return (ResultSetRewindable) cache.get(key);
>       }
>     }
>     rs = ResultSetFactory.makeRewindable(super.execSelect());
>       synchronized (cache) {
>          cache.put(key, rs);
>     }
> 
> The synchronization needs to be over the get and the put together.  What 
> if another thread comes in after the first block?  Both find .get is null.

This was my first solution, then I tried to be clever...
I wanted to reduce as much as possible the synchronized part and in particular
I wanted not to have a potentially long operation inside the synchronized
block. The risk is to have two threads making the query remotely and adding
the result to the cache.

My fear was that the approach below will reduce concurrency on the client,
but this is probably not true.

> You then have two threads executing the makeRewindable/execSelect  This 
> happens to be safe because of the operation super.execSelect() is safe 
> in parallel but in general the operation being cached is not thread safe.
> 
> Generally this is safer: a single sync over get and the 
> put-if-cache-miss.  And in this case just as fast, and makes only one 
> call to the far end:
>  (thread 2 waits on the lock, not makes the query on the remote end):
> 
>   synchronized (cache) {
>      if (cache.containsKey(key)) {
>            return (ResultSetRewindable) cache.get(key);
>      rs = ResultSetFactory.makeRewindable(super.execSelect());
>      cache.put(key, rs);
>     }

I'll follow your suggestion.

I also added one which uses Memcached and one which uses Redis as well:
https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/MemcachedQueryEngineHTTP.java
https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/RedisQueryEngineHTTP.java

Cache invalidation is the problem. :-)

I've tried to follow your advice on implementing a new QueryEngine
with a factory/implementation pair, but I got confused... it was
late and this is somehow "out-of-band" activity.

However, I would like to have a good/proper example to show how you
can have a cached query engine either local or remote.

Thanks,
Paolo

> 
>     Andy

Re: SPARQL query equality and "equivalence"...

Posted by Andy Seaborne <an...@epimorphics.com>.

> I had a first go at implementing a cached QueryEngineHTTP:
> https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/CachedQueryEngineHTTP.java
>
> Invalidation is by service end-point and this approach makes sense only
> if you have mostly reads with few and non frequent updates.
>
> I am not happy with all those synchronized and the static cache.

   synchronized (cache) {
      if (cache.containsKey(key)) {
            return (ResultSetRewindable) cache.get(key);
       }
     }
     rs = ResultSetFactory.makeRewindable(super.execSelect());
       synchronized (cache) {
          cache.put(key, rs);
     }

The synchronization needs to be over the get and the put together.  What 
if another thread comes in after the first block?  Both find .get is null.

You then have two threads executing the makeRewindable/execSelect  This 
happens to be safe because of the operation super.execSelect() is safe 
in parallel but in general the operation being cached is not thread safe.

Generally this is safer: a single sync over get and the 
put-if-cache-miss.  And in this case just as fast, and makes only one 
call to the far end:
  (thread 2 waits on the lock, not makes the query on the remote end):

   synchronized (cache) {
      if (cache.containsKey(key)) {
            return (ResultSetRewindable) cache.get(key);
      rs = ResultSetFactory.makeRewindable(super.execSelect());
      cache.put(key, rs);
     }

	Andy

Re: SPARQL query equality and "equivalence"...

Posted by Andy Seaborne <an...@epimorphics.com>.

> I also don't know how I can plug a new QueryEngineHTTP into ARQ, if it's
> possible.

Just on this part (more later):

arq.examples.MyQueryEngine

There's a factory/implementation pair.  The factory checking is a good 
place to see if it's in the query is in the preevaluated results cache.

	Andy

Re: SPARQL query equality and "equivalence"...

Posted by Paolo Castagna <ca...@googlemail.com>.


Andy Seaborne wrote:
> 
> 
> On 11/03/11 15:45, Paolo Castagna wrote:
>>
>>
>> Damian Steer wrote:
>>> On 11 Mar 2011, at 12:30, Damian Steer wrote:
>>>>> The only difference between Q3 and Q1|Q2 is that we have a different
>>>>> variable name (i.e. ?y instead of ?v). The queries are different, but
>>>>> they are "equivalent": they are structurally the same, only variable
>>>>> names are different.
>>>> I don't think it handles this (but I may be wrong).
>>>
>>> Ah, Op#equalTo(Op other, NodeIsomorphismMap labelMap) would do that,
>>> once you have the mapping.
>>
>> Thank you Damian, I'll try this.
>>
>> In the meantime, this is what I came up with (before reading your email):
>>
>> Op op1 = Algebra.compile(q1);
>> Op op3 = Algebra.compile(q3);
>>
>> Item it1 = SSE.parse(op1.toString());
>> it1 = ItemTransformer.transform(new RenameVariablesItemTransform(), it1);
>> Item it3 = SSE.parse(op3.toString());
>> it3 = ItemTransformer.transform(new RenameVariablesItemTransform(), it3);
>>
>> op1 = Algebra.parse(it1);
>> op3 = Algebra.parse(it3);
>>
>> assertEquals(op1, op3);
>>
>>
>>
>> The RenameVariablesItemTransform is:
>>
>> class RenameVariablesItemTransform extends ItemTransformBase {
>>
>> private HashMap<String, String> varNamesMapping = new HashMap<String,
>> String>();
> 
> Map<Var, Var>
> 
> !!!!

Yes.

Now I do:
varsMapping.put((Var) node, Var.alloc("v" + count++));

See RenameVariablesItemTransform.java here:
https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/RenameVariablesItemTransform.java

I had a first go at implementing a cached QueryEngineHTTP:
https://github.com/castagna/sparql-cache/raw/master/src/main/java/com/talis/labs/arq/CachedQueryEngineHTTP.java
Invalidation is by service end-point and this approach makes sense only if you have mostly reads with few and non frequent updates.

I am not happy with all those synchronized and the static cache.

I also don't know how I can plug a new QueryEngineHTTP into ARQ, if it's possible.

>> private int count = 0;
>>
>> @Override
>> public Item transform(Item item, Node node) {
>> if ( node.isVariable() ) {
> 
> Minor:
> Var.isVar(node)

Ack.

Thanks,
Paolo

> 
>> if ( ! varNamesMapping.containsKey(node.toString()) ) {
>> varNamesMapping.put(node.toString(), "v_" + count++);
>> }
>> Var var = Var.alloc(varNamesMapping.get(node.toString()));
>> return Item.createNode(var, item.getLine(), item.getColumn()) ;
>> } else {
>> return Item.createNode(node, item.getLine(), item.getColumn()) ;
>> }
>> }
>>
>> }
>>
>> Not an elegant solution, but it allows me to find (and count)
>> "equivalent" queries
>> in my log files. :-)
>>
>> Paolo
>>
>>
>>>
>>> Damian
>>>

Re: SPARQL query equality and "equivalence"...

Posted by Andy Seaborne <an...@epimorphics.com>.


On 11/03/11 15:45, Paolo Castagna wrote:
>
>
> Damian Steer wrote:
>> On 11 Mar 2011, at 12:30, Damian Steer wrote:
>>>> The only difference between Q3 and Q1|Q2 is that we have a different
>>>> variable name (i.e. ?y instead of ?v). The queries are different, but
>>>> they are "equivalent": they are structurally the same, only variable
>>>> names are different.
>>> I don't think it handles this (but I may be wrong).
>>
>> Ah, Op#equalTo(Op other, NodeIsomorphismMap labelMap) would do that,
>> once you have the mapping.
>
> Thank you Damian, I'll try this.
>
> In the meantime, this is what I came up with (before reading your email):
>
> Op op1 = Algebra.compile(q1);
> Op op3 = Algebra.compile(q3);
>
> Item it1 = SSE.parse(op1.toString());
> it1 = ItemTransformer.transform(new RenameVariablesItemTransform(), it1);
> Item it3 = SSE.parse(op3.toString());
> it3 = ItemTransformer.transform(new RenameVariablesItemTransform(), it3);
>
> op1 = Algebra.parse(it1);
> op3 = Algebra.parse(it3);
>
> assertEquals(op1, op3);
>
>
>
> The RenameVariablesItemTransform is:
>
> class RenameVariablesItemTransform extends ItemTransformBase {
>
> private HashMap<String, String> varNamesMapping = new HashMap<String,
> String>();

Map<Var, Var>

!!!!

> private int count = 0;
>
> @Override
> public Item transform(Item item, Node node) {
> if ( node.isVariable() ) {

Minor:
Var.isVar(node)

> if ( ! varNamesMapping.containsKey(node.toString()) ) {
> varNamesMapping.put(node.toString(), "v_" + count++);
> }
> Var var = Var.alloc(varNamesMapping.get(node.toString()));
> return Item.createNode(var, item.getLine(), item.getColumn()) ;
> } else {
> return Item.createNode(node, item.getLine(), item.getColumn()) ;
> }
> }
>
> }
>
> Not an elegant solution, but it allows me to find (and count)
> "equivalent" queries
> in my log files. :-)
>
> Paolo
>
>
>>
>> Damian
>>

Re: SPARQL query equality and "equivalence"...

Posted by Paolo Castagna <ca...@googlemail.com>.


Damian Steer wrote:
> On 11 Mar 2011, at 12:30, Damian Steer wrote:
>>> The only difference between Q3 and Q1|Q2 is that we have a different
>>> variable name (i.e. ?y instead of ?v). The queries are different, but
>>> they are "equivalent": they are structurally the same, only variable
>>> names are different.
>> I don't think it handles this (but I may be wrong).
> 
> Ah, Op#equalTo(Op other, NodeIsomorphismMap labelMap) would do that, once you have the mapping.

Thank you Damian, I'll try this.

In the meantime, this is what I came up with (before reading your email):

Op op1 = Algebra.compile(q1);
Op op3 = Algebra.compile(q3);

Item it1 = SSE.parse(op1.toString());
it1 = ItemTransformer.transform(new RenameVariablesItemTransform(), it1);
Item it3 = SSE.parse(op3.toString());
it3 = ItemTransformer.transform(new RenameVariablesItemTransform(), it3);
		
op1 = Algebra.parse(it1);
op3 = Algebra.parse(it3);
		
assertEquals(op1, op3);



The RenameVariablesItemTransform is:

class RenameVariablesItemTransform extends ItemTransformBase {

     private HashMap<String, String> varNamesMapping = new HashMap<String, String>();
     private int count = 0;
	
     @Override
     public Item transform(Item item, Node node) {
         if ( node.isVariable() ) {
             if ( ! varNamesMapping.containsKey(node.toString()) ) {
                 varNamesMapping.put(node.toString(), "v_" + count++);
             }
             Var var = Var.alloc(varNamesMapping.get(node.toString()));
             return Item.createNode(var, item.getLine(), item.getColumn()) ;
         } else {
             return Item.createNode(node, item.getLine(), item.getColumn()) ;			
         }
     }

}

Not an elegant solution, but it allows me to find (and count) "equivalent" queries
in my log files. :-)

Paolo


> 
> Damian
>

Re: SPARQL query equality and "equivalence"...

Posted by Damian Steer <d....@bristol.ac.uk>.

On 11 Mar 2011, at 12:30, Damian Steer wrote:
> 
>> The only difference between Q3 and Q1|Q2 is that we have a different
>> variable name (i.e. ?y instead of ?v). The queries are different, but
>> they are "equivalent": they are structurally the same, only variable
>> names are different.
> 
> I don't think it handles this (but I may be wrong).

Ah, Op#equalTo(Op other, NodeIsomorphismMap labelMap) would do that, once you have the mapping.

Damian

Re: SPARQL query equality and "equivalence"...

Posted by Damian Steer <d....@bristol.ac.uk>.

On 11 Mar 2011, at 12:20, Paolo Castagna wrote:

> Once the queries are parsed into two Query objects (say, q1 and q2),
> I can use q1.equals(q2) to check if two queries are equal (i.e. exactly
> the same, modulo extra spaces/formatting).

Quick tip here: use the algebra for comparisons (Algebra.compile(q1).equals(Algebra.... ).
The algebra handles the annoying formatting variance for you.

However...

> The only difference between Q3 and Q1|Q2 is that we have a different
> variable name (i.e. ?y instead of ?v). The queries are different, but
> they are "equivalent": they are structurally the same, only variable
> names are different.

I don't think it handles this (but I may be wrong).

Damian