You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Elli Schwarz <el...@yahoo.com> on 2012/10/23 17:47:34 UTC

Fueski with Larq - query anomaly


Hello,


I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks Osma!)  where I recompiled Jena (after adding the Larq dependency) to Jena revision 1399877 (this past Friday morning's version of the trunk). I'm noticing the following anomaly when querying the data:

First I insert the following triples:
prefix xsd: <http://www.w3.org/2001/XMLSchema#>
insert data {  graph <urn:test:foo> {
     <urn:test:s1> <urn:test:p1> "foo"^^xsd:string .
     <urn:test:s1> <urn:test:p2> "foo"^^xsd:string .
     <urn:test:s2> <urn:test:p3> "foo"^^xsd:string .
} }

Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an aside, I'd be very interested in a fix for this so I don't have to restart Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be able to work on it soon!) Once Fuseki is back up, I run the following query (I have default graph set as the union of named graphs by default):
PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
select * where {
     <urn:test:s1> ?p ?lit .
     ?lit pf:textMatch "foo" . 
}

and I get 2 results as I expect:

--------------------------------------------------------------------
| p             | lit                                              |
====================================================================
| <urn:test:p1> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> |
| <urn:test:p2> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> |
--------------------------------------------------------------------
However, when I flip the order of my query like this:

PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
select * where {
     ?lit pf:textMatch "foo" . 
     <urn:test:s1> ?p ?lit . 

I get 6 results, instead of the two I expect:

--------------------------------------------------------------------
| lit                                              | p             |
====================================================================
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
| "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
--------------------------------------------------------------------My guess as to what happens is that in the second query, first the query executer executes the first line (the ?lit pf:textMatch "foo") and this returns 3 results for foo, since there are 3 literals for "foo". Then, the next line of the query has three bindings to ?lit, so it produces the 6 results above (2 for each "foo" literal since there are 2 properties for <urn:test:s1>). I know that I can avoid this by using a SELECT DISTINCT, but I still think the query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd rather avoid).

Another point I've noticed is that in my other (much more complex) queries, against a much larger dataset (~1.5 million triples), if I put the pf:textMatch line anywhere but in the very beginning of the query, the query takes a VERY long time to execute. If I put it as the first line in the query, the query runs quickly. My guess for this is that the query is executed in order, and it takes much more work for the query executer to run the other parts of my query which contain many results, and then have to go back and essentially filter out those results where the literal doesn't match the pf:textMatch. I can always place the pf:textMatch line first, but then I'm back to the problem mentioned above where I get back too many duplicate results.

Thank you very much for your help!
-Elli

Re: Fueski with Larq - query anomaly

Posted by Paolo Castagna <ca...@gmail.com>.
Hi Elli

On 23/10/12 16:47, Elli Schwarz wrote:
>
>
> Hello,
>
>
> I am using Fuseki with Larq (thanks to Osma's recent instructions - thanks Osma!)  where I recompiled Jena (after adding the Larq dependency) to Jena revision 1399877 (this past Friday morning's version of the trunk). I'm noticing the following anomaly when querying the data:
>
> First I insert the following triples:
> prefix xsd: <http://www.w3.org/2001/XMLSchema#>
> insert data {  graph <urn:test:foo> {
>       <urn:test:s1> <urn:test:p1> "foo"^^xsd:string .
>       <urn:test:s1> <urn:test:p2> "foo"^^xsd:string .
>       <urn:test:s2> <urn:test:p3> "foo"^^xsd:string .
> } }
>
> Then I stop Fuseki, delete my index directory, and restart Fuseki. (As an aside, I'd be very interested in a fix for this so I don't have to restart Fuseki to rebuild the index - I'm watching JENA-164 and hoping someone will be able to work on it soon!)

Re: JENA-164 ... yeah, I'd love to help you out, but it's a sort of 
architectural issue of Jena IMHO. It should be easier for developers to 
listen to events as triples are added/removed so that you can attach 
external indexes and keep them in sync.

There are multiple paths which you can use to change RDF data: APIs, 
SPARQL, etc. From a use point of view, you would like to keep your 
external index always in sync, no matter where the updates come from.

 > Once Fuseki is back up, I run the following query (I have default 
graph set as the union of named graphs by default):
> PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
> select * where {
>       <urn:test:s1> ?p ?lit .
>       ?lit pf:textMatch "foo" .
> }
>
> and I get 2 results as I expect:
>
> --------------------------------------------------------------------
> | p             | lit                                              |
> ====================================================================
> | <urn:test:p1> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> |
> | <urn:test:p2> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> |
> --------------------------------------------------------------------
> However, when I flip the order of my query like this:
>
> PREFIX pf: <http://jena.hpl.hp.com/ARQ/property#>
> select * where {
>       ?lit pf:textMatch "foo" .
>       <urn:test:s1> ?p ?lit .
>
> I get 6 results, instead of the two I expect:
>
> --------------------------------------------------------------------
> | lit                                              | p             |
> ====================================================================
> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p1> |
> | "foo"^^<http://www.w3.org/2001/XMLSchema#string> | <urn:test:p2> |
> --------------------------------------------------------------------My guess as to what happens is that in the second query, first the query executer executes the first line (the ?lit pf:textMatch "foo") and this returns 3 results for foo, since there are 3 literals for "foo". Then, the next line of the query has three bindings to ?lit, so it produces the 6 results above (2 for each "foo" literal since there are 2 properties for <urn:test:s1>). I know that I can avoid this by using a SELECT DISTINCT, but I still think the query shouldn't produce different results based on switching the order. Additionally, if I put this in a CONSTRUCT query, I can't use DISTINCT to eliminate the duplicate results (unless I use a SELECT DISTINCT subquery which I'd rather avoid).

I am not sure, at the moment I have no clear idea on how this problem 
could be fixed.

Paolo

>
> Another point I've noticed is that in my other (much more complex) queries, against a much larger dataset (~1.5 million triples), if I put the pf:textMatch line anywhere but in the very beginning of the query, the query takes a VERY long time to execute. If I put it as the first line in the query, the query runs quickly. My guess for this is that the query is executed in order, and it takes much more work for the query executer to run the other parts of my query which contain many results, and then have to go back and essentially filter out those results where the literal doesn't match the pf:textMatch. I can always place the pf:textMatch line first, but then I'm back to the problem mentioned above where I get back too many duplicate results.
>
> Thank you very much for your help!
> -Elli
>