You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Sarven Capadisli <in...@csarven.ca> on 2012/03/01 15:07:42 UTC

Distinct graphs

Hi,

I'm doing the following query:

SELECT DISTINCT ?g
WHERE {
   GRAPH ?g {
     ?s ?p ?o .
   }
}

in two ways:

1) Using SPARQLer (web interface for the SPARQL Endpoint)

2) Command-line SOH with s-query

Option 1) doesn't give me a response back in a timely manner and 
eventually throws a proxy error for me.

Option 2) does respond and displays graph names incrementally. Until it 
gets the largest graph in which throws the following error:

/usr/lib/ruby/1.8/timeout.rb:60:in `rbuf_fill': execution expired 
(Timeout::Error)

When I try the following:

tdb.tdbquery --desc=tdb2.ttl 'SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s 
?p ?o . } }'

I don't get a response, but just sits there.

Ideas?

-Sarven

Re: Distinct graphs

Posted by Paolo Castagna <ca...@googlemail.com>.

Sarven Capadisli wrote:
> SELECT DISTINCT ?g
> WHERE {
>   GRAPH ?g {
>     ?s ?p ?o .
>   }
> }

Hi Sarven,
asking for a list of the graphs in an RDF dataset is IMHO a pretty reasonable
use case and thing to do.

Your query, however, I suspect scan through all your data. This would explain
timeouts and long running times.

One suggestion is: if you are sure each graph has some specific triples and/or
properties, you could make the ?s ?p ?o triple pattern more specific and reduce
the amount of bindings/data. This might not be possible.

Another way would be to see if this pattern can be spotted by the optimizer and
do it in a better way. We could use the GSPO index (which is sorted by G, then
S, then P, then O) and apply the "reduced" operation of the SPARQL algebra which
over a sorted stream is equivalent to "distinct". I think it should work.

If others think this is a good idea, I'll go ahead, create a JIRA issue and
try to do it.

Listing the graphs in an RDF dataset seems to me quite an important and common
use case, don't you agree?

My 2 cents,
Paolo

Re: Distinct graphs

Posted by Sarven Capadisli <in...@csarven.ca>.

On 12-03-09 02:49 AM, Paolo Castagna wrote:
> Sarven Capadisli wrote:
>> On 12-03-08 02:47 PM, Paolo Castagna wrote:
>>> Rob Vesse wrote:
>>>> Yes one possibility that me and Andy raised in that discussion was the
>>>> use of the following:
>>>>
>>>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>>>
>>>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>>>> (which may of course be modified by the presence of FROM and FROM NAMED)
>>>> and the empty graph pattern returns a single empty solution (i.e. always
>>>> matches) then on paper at least this query should do the same job and be
>>>> much more performant.  Whether this query works may vary depending on
>>>> how accurately an engine actually implements the SPARQL spec because the
>>>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>>>> in the spec and differences of opinion between implementers
>>>
>>> Indeed, the optimization might already be there... Sarven, could you
>>> try to see
>>> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
>>
>> First of all, that worked! It took about 10-15 minutes the first time I
>> tried it. I just ran it again.. and 30 minutes in, still waiting for a
>> response. Odd.
>
> Hi Sarven,
> that is too slow for any UI interaction, I suggest you try the other approach,
> you could take the opportunity to use the VoID vocabulary and/or the SPARQL 1.1
> Service Description to add triples which describe your data.
>
> This way you can make the { ?s ?p ?o } more selective and search for:
> { ?s a void:Dataset } or { ?s a sd:Dataset } or { ?s a sd:Graph }.
>
> Have a look here:
>
>   - http://www.w3.org/TR/void/
>   - http://www.w3.org/TR/sparql11-service-description/

I will have the VoID+SG in any case in the store, however, the simplest 
of the queries is the one that we are trying to speed up i.e., SELECT 
DISTINCT ?g WHERE { GRAPH { ?s ?p ?o. } } I imagine to be most widely 
used, followed by SELECT DISTINCT ?g { GRAPH ?g { } }. Of course the 
consumer that's aware of VoID+SG's presence by way of 
/.well-known/void.ttl will use it, however it will escape the rest. And, 
actual queries reveal what's really in the store, and more reliable in 
comparison to to some statements making the claim.

Are we ultimately facing the issue where as the store gets larger, 
getting the list of graphs becomes more difficult?

There is a way to add the Graph names in TDB assembler. Can this help in 
any way with the queries?

> Out of curiosity, how big are your GSPO.dat and GSPO.idn files in the TDB
> directory? To answer you query, TDB needs to scan through all that index.
> While with { ?s a void:Dataset } will need to scan through only a small
> fraction of the POSG index, I suppose.

GSPO.dat 23983030272 bytes
GSPO.idn 293601280 bytes

-Sarven

Re: Literal XML pass-through?

Posted by David Byrden <ge...@byrden.com>.

> >> I think it's because that is not a legal XML Literal.  XML 
> literals must be canonical to be valid and valid to be output as 
> parseType-literal.


People, thank you for the advice. I was able to get
results thanks to you. And I concluded that Canonical
XML  is too fragile for hand editing, so I must drop the
whole thing.  :)

David

Re: Literal XML pass-through?

Posted by Andy Seaborne <an...@apache.org>.

On 09/03/12 09:35, David Byrden wrote:
>
> Sorry if this is a simple question, but I have looked for
> examples without success...
>
> I want Jena to read N3 with literal XML values,
> possibly including namespaces. Then write them out
> as RDF, preserving the XML and its namespaces.
>
> An example of my input:
>
> <THING> dc:description
> "Text and a <link to='place'>complex element</link>."^^rdf:XMLLiteral .

I think it's because that is not a legal XML Literal.  XML literals 
must be canonical to be valid and valid to be output as parseType-literal.

The rules are there to trip you up.

http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral

In this case it's a simple matter of using " not ' in the attribute.

"riot --validate" does checking but it only tells you if it's right or 
wrong, not why it's wrong.

> But the RDF output is escaped XML; the angle brackets
> are replaced by escape sequences. I want to process the output
> with XSLT, so I would prefer the XML to remain as such.

This happens when the lexical for is illegal - it outputs the content as 
if it were a string in the XML, making any XML characters safe.

> Is this even supposed to be possible?
>
> Oddly enough, if I use very simple XML (no attributes, no
> namespaces) it does approximately what I want!
>
> Thank you.
> David

(You may have guessed I'm not a big fan of rdf:XMLLiterals.  Far too 
complicated for practical use.)

	Andy

Re: Literal XML pass-through?

Posted by Dave Reynolds <da...@gmail.com>.

On 09/03/12 09:35, David Byrden wrote:
>
> Sorry if this is a simple question, but I have looked for
> examples without success...
>
> I want Jena to read N3 with literal XML values,
> possibly including namespaces. Then write them out
> as RDF, preserving the XML and its namespaces.
>
> An example of my input:
>
> <THING> dc:description
> "Text and a <link to='place'>complex element</link>."^^rdf:XMLLiteral .
>
> But the RDF output is escaped XML; the angle brackets
> are replaced by escape sequences. I want to process the output
> with XSLT, so I would prefer the XML to remain as such.
>
> Is this even supposed to be possible?

Yes and works for me:

temp.ttl =

[[[
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://www.openjena.org/eg#> .
:a :p 'Text and a <link to="place">complex 
element</link>.'^^rdf:XMLLiteral  .
]]]


rdfcat temp.ttl produces

[[[
<rdf:RDF
     xmlns="http://www.openjena.org/eg#"
     xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
   <rdf:Description rdf:about="http://www.openjena.org/eg#a">
     <p rdf:parseType="Literal">Text and a <link to="place">complex 
element</link>.</p>
   </rdf:Description>
</rdf:RDF>
]]]


If you are seeing quoting then you probably have an error in the 
datatype URI, either in the rdf: declaration or in the spelling of 
XMLLiteral itself.

Dave

Literal XML pass-through?

Posted by David Byrden <ge...@byrden.com>.

Sorry if this is a simple question, but I have looked for
examples without success...

I want Jena to read N3 with literal XML values,
possibly including namespaces. Then write them out
as RDF, preserving the XML and its namespaces.

An example of my input:

<THING>     dc:description
  "Text and a <link to='place'>complex element</link>."^^rdf:XMLLiteral    .

But the RDF output is escaped XML; the angle brackets
are replaced by escape sequences. I want to process the output
with XSLT, so I would prefer the XML to remain as such.

Is this even supposed to be possible?

Oddly enough, if I use very simple XML (no attributes, no
namespaces) it does approximately what I want!

Thank you.
David

Re: Distinct graphs

Posted by Paolo Castagna <ca...@googlemail.com>.

Sarven Capadisli wrote:
> On 12-03-08 02:47 PM, Paolo Castagna wrote:
>> Rob Vesse wrote:
>>> Yes one possibility that me and Andy raised in that discussion was the
>>> use of the following:
>>>
>>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>>
>>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>>> (which may of course be modified by the presence of FROM and FROM NAMED)
>>> and the empty graph pattern returns a single empty solution (i.e. always
>>> matches) then on paper at least this query should do the same job and be
>>> much more performant.  Whether this query works may vary depending on
>>> how accurately an engine actually implements the SPARQL spec because the
>>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>>> in the spec and differences of opinion between implementers
>>
>> Indeed, the optimization might already be there... Sarven, could you
>> try to see
>> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
> 
> First of all, that worked! It took about 10-15 minutes the first time I
> tried it. I just ran it again.. and 30 minutes in, still waiting for a
> response. Odd.

Hi Sarven,
that is too slow for any UI interaction, I suggest you try the other approach,
you could take the opportunity to use the VoID vocabulary and/or the SPARQL 1.1
Service Description to add triples which describe your data.

This way you can make the { ?s ?p ?o } more selective and search for:
{ ?s a void:Dataset } or { ?s a sd:Dataset } or { ?s a sd:Graph }.

Have a look here:

 - http://www.w3.org/TR/void/
 - http://www.w3.org/TR/sparql11-service-description/

Out of curiosity, how big are your GSPO.dat and GSPO.idn files in the TDB
directory? To answer you query, TDB needs to scan through all that index.
While with { ?s a void:Dataset } will need to scan through only a small
fraction of the POSG index, I suppose.

Try and let us know,
Paolo

> 
> -Sarven

Re: Distinct graphs

Posted by Andy Seaborne <an...@apache.org>.

On 08/03/12 21:59, Robert Vesse wrote:
> I don't know a lot about the internals of TDB but it may be that the
> two queries are broadly speaking equivalent i.e. in order for TDB to
> determine what graphs are in the dataset it still has to do a full
> scan because AFAIK it is just storing quads and not necessarily
> storing any record of what named graphs are present independent of
> the quads - am I correct in this assumption Andy?

Yes.

> If that is the case the only reason my suggested query is faster is
> because it doesn't have to store the ?s ?p ?o solutions that the
> first method generates

If the SELECT clause does not touch ?s ?p ?o then they are not fetched 
from the node table (the binding is delayed evaluation - if you don't 
touch a variable, it's not converted from NodeId to node)

e.g. SELECT (count(*) AS ?C) { ?s ?p ?o }

does not touch the node table at all.

	Andy

>
> Rob

Re: Distinct graphs

Posted by Robert Vesse <rv...@yarcdata.com>.

I don't know a lot about the internals of TDB but it may be that the two queries are broadly speaking equivalent i.e. in order for TDB to determine what graphs are in the dataset it still has to do a full scan because AFAIK it is just storing quads and not necessarily storing any record of what named graphs are present independent of the quads - am I correct in this assumption Andy?

If that is the case the only reason my suggested query is faster is because it doesn't have to store the ?s ?p ?o solutions that the first method generates

Rob

On Mar 8, 2012, at 12:54 PM, Sarven Capadisli wrote:

> On 12-03-08 02:47 PM, Paolo Castagna wrote:
>> Rob Vesse wrote:
>>> Yes one possibility that me and Andy raised in that discussion was the
>>> use of the following:
>>> 
>>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>> 
>>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>>> (which may of course be modified by the presence of FROM and FROM NAMED)
>>> and the empty graph pattern returns a single empty solution (i.e. always
>>> matches) then on paper at least this query should do the same job and be
>>> much more performant.  Whether this query works may vary depending on
>>> how accurately an engine actually implements the SPARQL spec because the
>>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>>> in the spec and differences of opinion between implementers
>> 
>> Indeed, the optimization might already be there... Sarven, could you try to see
>> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
> 
> First of all, that worked! It took about 10-15 minutes the first time I tried it. I just ran it again.. and 30 minutes in, still waiting for a response. Odd.
> 
> -Sarven

Re: Distinct graphs

Posted by Sarven Capadisli <in...@csarven.ca>.

On 12-03-08 02:47 PM, Paolo Castagna wrote:
> Rob Vesse wrote:
>> Yes one possibility that me and Andy raised in that discussion was the
>> use of the following:
>>
>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>
>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>> (which may of course be modified by the presence of FROM and FROM NAMED)
>> and the empty graph pattern returns a single empty solution (i.e. always
>> matches) then on paper at least this query should do the same job and be
>> much more performant.  Whether this query works may vary depending on
>> how accurately an engine actually implements the SPARQL spec because the
>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>> in the spec and differences of opinion between implementers
>
> Indeed, the optimization might already be there... Sarven, could you try to see
> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?

First of all, that worked! It took about 10-15 minutes the first time I 
tried it. I just ran it again.. and 30 minutes in, still waiting for a 
response. Odd.

-Sarven

Re: Distinct graphs

Posted by Paolo Castagna <ca...@googlemail.com>.

Rob Vesse wrote:
> Yes one possibility that me and Andy raised in that discussion was the
> use of the following:
> 
> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
> 
> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
> (which may of course be modified by the presence of FROM and FROM NAMED)
> and the empty graph pattern returns a single empty solution (i.e. always
> matches) then on paper at least this query should do the same job and be
> much more performant.  Whether this query works may vary depending on
> how accurately an engine actually implements the SPARQL spec because the
> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
> in the spec and differences of opinion between implementers

Indeed, the optimization might already be there... Sarven, could you try to see
if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?


qparse --print quad --explain "SELECT DISTINCT ?g { GRAPH ?g { ?s ?p ?o } }"
SELECT DISTINCT  ?g
WHERE
  { GRAPH ?g
      { ?s ?p ?o }
  }
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
  (project (?g)
    (quadpattern (quad ?g ?s ?p ?o))))
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
  (project (?g)
    (graph ?g
      (bgp (triple ?s ?p ?o)))))



qparse --print quad --explain "SELECT DISTINCT ?g { GRAPH ?g { } }"
SELECT DISTINCT  ?g
WHERE
  { GRAPH ?g
      {  }
  }
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
  (project (?g)
    (datasetnames ?g)))
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
  (project (?g)
    (graph ?g
      (table unit))))


Paolo

Re: Distinct graphs

Posted by Rob Vesse <ra...@ecs.soton.ac.uk>.

On 3/8/12 11:30 AM, Paolo Castagna wrote:
> Sarven Capadisli wrote:
>> I'm doing the following query:
>>
>> SELECT DISTINCT ?g
>> WHERE {
>>    GRAPH ?g {
>>      ?s ?p ?o .
>>    }
>> }
> See also discussion here:
> http://answers.semanticweb.com/questions/393/clarification-of-meaning-of-graph-clause-in-sparql-with-no-from-clause
>
> Paolo
Yes one possibility that me and Andy raised in that discussion was the 
use of the following:

SELECT DISTINCT ?g WHERE { GRAPH ?g { } }

Since GRAPH ?g is defined as an iteration over all graphs in the dataset 
(which may of course be modified by the presence of FROM and FROM NAMED) 
and the empty graph pattern returns a single empty solution (i.e. always 
matches) then on paper at least this query should do the same job and be 
much more performant.  Whether this query works may vary depending on 
how accurately an engine actually implements the SPARQL spec because the 
whole dataset/GRAPH interaction is one of the areas prone to ambiguities 
in the spec and differences of opinion between implementers

Rob

Re: Distinct graphs

Posted by Paolo Castagna <ca...@googlemail.com>.

Sarven Capadisli wrote:
> I'm doing the following query:
> 
> SELECT DISTINCT ?g
> WHERE {
>   GRAPH ?g {
>     ?s ?p ?o .
>   }
> }

See also discussion here:
http://answers.semanticweb.com/questions/393/clarification-of-meaning-of-graph-clause-in-sparql-with-no-from-clause

Paolo