You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Sarven Capadisli <in...@csarven.ca> on 2012/03/01 15:07:42 UTC
Distinct graphs
Hi,
I'm doing the following query:
SELECT DISTINCT ?g
WHERE {
GRAPH ?g {
?s ?p ?o .
}
}
in two ways:
1) Using SPARQLer (web interface for the SPARQL Endpoint)
2) Command-line SOH with s-query
Option 1) doesn't give me a response back in a timely manner and
eventually throws a proxy error for me.
Option 2) does respond and displays graph names incrementally. Until it
gets the largest graph in which throws the following error:
/usr/lib/ruby/1.8/timeout.rb:60:in `rbuf_fill': execution expired
(Timeout::Error)
When I try the following:
tdb.tdbquery --desc=tdb2.ttl 'SELECT DISTINCT ?g WHERE { GRAPH ?g { ?s
?p ?o . } }'
I don't get a response, but just sits there.
Ideas?
-Sarven
Re: Distinct graphs
Posted by Paolo Castagna <ca...@googlemail.com>.
Sarven Capadisli wrote:
> SELECT DISTINCT ?g
> WHERE {
> GRAPH ?g {
> ?s ?p ?o .
> }
> }
Hi Sarven,
asking for a list of the graphs in an RDF dataset is IMHO a pretty reasonable
use case and thing to do.
Your query, however, I suspect scan through all your data. This would explain
timeouts and long running times.
One suggestion is: if you are sure each graph has some specific triples and/or
properties, you could make the ?s ?p ?o triple pattern more specific and reduce
the amount of bindings/data. This might not be possible.
Another way would be to see if this pattern can be spotted by the optimizer and
do it in a better way. We could use the GSPO index (which is sorted by G, then
S, then P, then O) and apply the "reduced" operation of the SPARQL algebra which
over a sorted stream is equivalent to "distinct". I think it should work.
If others think this is a good idea, I'll go ahead, create a JIRA issue and
try to do it.
Listing the graphs in an RDF dataset seems to me quite an important and common
use case, don't you agree?
My 2 cents,
Paolo
Re: Distinct graphs
Posted by Sarven Capadisli <in...@csarven.ca>.
On 12-03-09 02:49 AM, Paolo Castagna wrote:
> Sarven Capadisli wrote:
>> On 12-03-08 02:47 PM, Paolo Castagna wrote:
>>> Rob Vesse wrote:
>>>> Yes one possibility that me and Andy raised in that discussion was the
>>>> use of the following:
>>>>
>>>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>>>
>>>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>>>> (which may of course be modified by the presence of FROM and FROM NAMED)
>>>> and the empty graph pattern returns a single empty solution (i.e. always
>>>> matches) then on paper at least this query should do the same job and be
>>>> much more performant. Whether this query works may vary depending on
>>>> how accurately an engine actually implements the SPARQL spec because the
>>>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>>>> in the spec and differences of opinion between implementers
>>>
>>> Indeed, the optimization might already be there... Sarven, could you
>>> try to see
>>> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
>>
>> First of all, that worked! It took about 10-15 minutes the first time I
>> tried it. I just ran it again.. and 30 minutes in, still waiting for a
>> response. Odd.
>
> Hi Sarven,
> that is too slow for any UI interaction, I suggest you try the other approach,
> you could take the opportunity to use the VoID vocabulary and/or the SPARQL 1.1
> Service Description to add triples which describe your data.
>
> This way you can make the { ?s ?p ?o } more selective and search for:
> { ?s a void:Dataset } or { ?s a sd:Dataset } or { ?s a sd:Graph }.
>
> Have a look here:
>
> - http://www.w3.org/TR/void/
> - http://www.w3.org/TR/sparql11-service-description/
I will have the VoID+SG in any case in the store, however, the simplest
of the queries is the one that we are trying to speed up i.e., SELECT
DISTINCT ?g WHERE { GRAPH { ?s ?p ?o. } } I imagine to be most widely
used, followed by SELECT DISTINCT ?g { GRAPH ?g { } }. Of course the
consumer that's aware of VoID+SG's presence by way of
/.well-known/void.ttl will use it, however it will escape the rest. And,
actual queries reveal what's really in the store, and more reliable in
comparison to to some statements making the claim.
Are we ultimately facing the issue where as the store gets larger,
getting the list of graphs becomes more difficult?
There is a way to add the Graph names in TDB assembler. Can this help in
any way with the queries?
> Out of curiosity, how big are your GSPO.dat and GSPO.idn files in the TDB
> directory? To answer you query, TDB needs to scan through all that index.
> While with { ?s a void:Dataset } will need to scan through only a small
> fraction of the POSG index, I suppose.
GSPO.dat 23983030272 bytes
GSPO.idn 293601280 bytes
-Sarven
Re: Literal XML pass-through?
Posted by David Byrden <ge...@byrden.com>.
> >> I think it's because that is not a legal XML Literal. XML
> literals must be canonical to be valid and valid to be output as
> parseType-literal.
People, thank you for the advice. I was able to get
results thanks to you. And I concluded that Canonical
XML is too fragile for hand editing, so I must drop the
whole thing. :)
David
Re: Literal XML pass-through?
Posted by Andy Seaborne <an...@apache.org>.
On 09/03/12 09:35, David Byrden wrote:
>
> Sorry if this is a simple question, but I have looked for
> examples without success...
>
> I want Jena to read N3 with literal XML values,
> possibly including namespaces. Then write them out
> as RDF, preserving the XML and its namespaces.
>
> An example of my input:
>
> <THING> dc:description
> "Text and a <link to='place'>complex element</link>."^^rdf:XMLLiteral .
I think it's because that is not a legal XML Literal. XML literals
must be canonical to be valid and valid to be output as parseType-literal.
The rules are there to trip you up.
http://www.w3.org/TR/rdf-concepts/#section-XMLLiteral
In this case it's a simple matter of using " not ' in the attribute.
"riot --validate" does checking but it only tells you if it's right or
wrong, not why it's wrong.
> But the RDF output is escaped XML; the angle brackets
> are replaced by escape sequences. I want to process the output
> with XSLT, so I would prefer the XML to remain as such.
This happens when the lexical for is illegal - it outputs the content as
if it were a string in the XML, making any XML characters safe.
> Is this even supposed to be possible?
>
> Oddly enough, if I use very simple XML (no attributes, no
> namespaces) it does approximately what I want!
>
> Thank you.
> David
(You may have guessed I'm not a big fan of rdf:XMLLiterals. Far too
complicated for practical use.)
Andy
Re: Literal XML pass-through?
Posted by Dave Reynolds <da...@gmail.com>.
On 09/03/12 09:35, David Byrden wrote:
>
> Sorry if this is a simple question, but I have looked for
> examples without success...
>
> I want Jena to read N3 with literal XML values,
> possibly including namespaces. Then write them out
> as RDF, preserving the XML and its namespaces.
>
> An example of my input:
>
> <THING> dc:description
> "Text and a <link to='place'>complex element</link>."^^rdf:XMLLiteral .
>
> But the RDF output is escaped XML; the angle brackets
> are replaced by escape sequences. I want to process the output
> with XSLT, so I would prefer the XML to remain as such.
>
> Is this even supposed to be possible?
Yes and works for me:
temp.ttl =
[[[
@prefix rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#> .
@prefix : <http://www.openjena.org/eg#> .
:a :p 'Text and a <link to="place">complex
element</link>.'^^rdf:XMLLiteral .
]]]
rdfcat temp.ttl produces
[[[
<rdf:RDF
xmlns="http://www.openjena.org/eg#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<rdf:Description rdf:about="http://www.openjena.org/eg#a">
<p rdf:parseType="Literal">Text and a <link to="place">complex
element</link>.</p>
</rdf:Description>
</rdf:RDF>
]]]
If you are seeing quoting then you probably have an error in the
datatype URI, either in the rdf: declaration or in the spelling of
XMLLiteral itself.
Dave
Literal XML pass-through?
Posted by David Byrden <ge...@byrden.com>.
Sorry if this is a simple question, but I have looked for
examples without success...
I want Jena to read N3 with literal XML values,
possibly including namespaces. Then write them out
as RDF, preserving the XML and its namespaces.
An example of my input:
<THING> dc:description
"Text and a <link to='place'>complex element</link>."^^rdf:XMLLiteral .
But the RDF output is escaped XML; the angle brackets
are replaced by escape sequences. I want to process the output
with XSLT, so I would prefer the XML to remain as such.
Is this even supposed to be possible?
Oddly enough, if I use very simple XML (no attributes, no
namespaces) it does approximately what I want!
Thank you.
David
Re: Distinct graphs
Posted by Paolo Castagna <ca...@googlemail.com>.
Sarven Capadisli wrote:
> On 12-03-08 02:47 PM, Paolo Castagna wrote:
>> Rob Vesse wrote:
>>> Yes one possibility that me and Andy raised in that discussion was the
>>> use of the following:
>>>
>>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>>
>>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>>> (which may of course be modified by the presence of FROM and FROM NAMED)
>>> and the empty graph pattern returns a single empty solution (i.e. always
>>> matches) then on paper at least this query should do the same job and be
>>> much more performant. Whether this query works may vary depending on
>>> how accurately an engine actually implements the SPARQL spec because the
>>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>>> in the spec and differences of opinion between implementers
>>
>> Indeed, the optimization might already be there... Sarven, could you
>> try to see
>> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
>
> First of all, that worked! It took about 10-15 minutes the first time I
> tried it. I just ran it again.. and 30 minutes in, still waiting for a
> response. Odd.
Hi Sarven,
that is too slow for any UI interaction, I suggest you try the other approach,
you could take the opportunity to use the VoID vocabulary and/or the SPARQL 1.1
Service Description to add triples which describe your data.
This way you can make the { ?s ?p ?o } more selective and search for:
{ ?s a void:Dataset } or { ?s a sd:Dataset } or { ?s a sd:Graph }.
Have a look here:
- http://www.w3.org/TR/void/
- http://www.w3.org/TR/sparql11-service-description/
Out of curiosity, how big are your GSPO.dat and GSPO.idn files in the TDB
directory? To answer you query, TDB needs to scan through all that index.
While with { ?s a void:Dataset } will need to scan through only a small
fraction of the POSG index, I suppose.
Try and let us know,
Paolo
>
> -Sarven
Re: Distinct graphs
Posted by Andy Seaborne <an...@apache.org>.
On 08/03/12 21:59, Robert Vesse wrote:
> I don't know a lot about the internals of TDB but it may be that the
> two queries are broadly speaking equivalent i.e. in order for TDB to
> determine what graphs are in the dataset it still has to do a full
> scan because AFAIK it is just storing quads and not necessarily
> storing any record of what named graphs are present independent of
> the quads - am I correct in this assumption Andy?
Yes.
> If that is the case the only reason my suggested query is faster is
> because it doesn't have to store the ?s ?p ?o solutions that the
> first method generates
If the SELECT clause does not touch ?s ?p ?o then they are not fetched
from the node table (the binding is delayed evaluation - if you don't
touch a variable, it's not converted from NodeId to node)
e.g. SELECT (count(*) AS ?C) { ?s ?p ?o }
does not touch the node table at all.
Andy
>
> Rob
Re: Distinct graphs
Posted by Robert Vesse <rv...@yarcdata.com>.
I don't know a lot about the internals of TDB but it may be that the two queries are broadly speaking equivalent i.e. in order for TDB to determine what graphs are in the dataset it still has to do a full scan because AFAIK it is just storing quads and not necessarily storing any record of what named graphs are present independent of the quads - am I correct in this assumption Andy?
If that is the case the only reason my suggested query is faster is because it doesn't have to store the ?s ?p ?o solutions that the first method generates
Rob
On Mar 8, 2012, at 12:54 PM, Sarven Capadisli wrote:
> On 12-03-08 02:47 PM, Paolo Castagna wrote:
>> Rob Vesse wrote:
>>> Yes one possibility that me and Andy raised in that discussion was the
>>> use of the following:
>>>
>>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>>
>>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>>> (which may of course be modified by the presence of FROM and FROM NAMED)
>>> and the empty graph pattern returns a single empty solution (i.e. always
>>> matches) then on paper at least this query should do the same job and be
>>> much more performant. Whether this query works may vary depending on
>>> how accurately an engine actually implements the SPARQL spec because the
>>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>>> in the spec and differences of opinion between implementers
>>
>> Indeed, the optimization might already be there... Sarven, could you try to see
>> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
>
> First of all, that worked! It took about 10-15 minutes the first time I tried it. I just ran it again.. and 30 minutes in, still waiting for a response. Odd.
>
> -Sarven
Re: Distinct graphs
Posted by Sarven Capadisli <in...@csarven.ca>.
On 12-03-08 02:47 PM, Paolo Castagna wrote:
> Rob Vesse wrote:
>> Yes one possibility that me and Andy raised in that discussion was the
>> use of the following:
>>
>> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>>
>> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
>> (which may of course be modified by the presence of FROM and FROM NAMED)
>> and the empty graph pattern returns a single empty solution (i.e. always
>> matches) then on paper at least this query should do the same job and be
>> much more performant. Whether this query works may vary depending on
>> how accurately an engine actually implements the SPARQL spec because the
>> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
>> in the spec and differences of opinion between implementers
>
> Indeed, the optimization might already be there... Sarven, could you try to see
> if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
First of all, that worked! It took about 10-15 minutes the first time I
tried it. I just ran it again.. and 30 minutes in, still waiting for a
response. Odd.
-Sarven
Re: Distinct graphs
Posted by Paolo Castagna <ca...@googlemail.com>.
Rob Vesse wrote:
> Yes one possibility that me and Andy raised in that discussion was the
> use of the following:
>
> SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
>
> Since GRAPH ?g is defined as an iteration over all graphs in the dataset
> (which may of course be modified by the presence of FROM and FROM NAMED)
> and the empty graph pattern returns a single empty solution (i.e. always
> matches) then on paper at least this query should do the same job and be
> much more performant. Whether this query works may vary depending on
> how accurately an engine actually implements the SPARQL spec because the
> whole dataset/GRAPH interaction is one of the areas prone to ambiguities
> in the spec and differences of opinion between implementers
Indeed, the optimization might already be there... Sarven, could you try to see
if SELECT DISTINCT ?g { GRAPH ?g { } } gives you what you want, faster?
qparse --print quad --explain "SELECT DISTINCT ?g { GRAPH ?g { ?s ?p ?o } }"
SELECT DISTINCT ?g
WHERE
{ GRAPH ?g
{ ?s ?p ?o }
}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
(project (?g)
(quadpattern (quad ?g ?s ?p ?o))))
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
(project (?g)
(graph ?g
(bgp (triple ?s ?p ?o)))))
qparse --print quad --explain "SELECT DISTINCT ?g { GRAPH ?g { } }"
SELECT DISTINCT ?g
WHERE
{ GRAPH ?g
{ }
}
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
(project (?g)
(datasetnames ?g)))
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
(distinct
(project (?g)
(graph ?g
(table unit))))
Paolo
Re: Distinct graphs
Posted by Rob Vesse <ra...@ecs.soton.ac.uk>.
On 3/8/12 11:30 AM, Paolo Castagna wrote:
> Sarven Capadisli wrote:
>> I'm doing the following query:
>>
>> SELECT DISTINCT ?g
>> WHERE {
>> GRAPH ?g {
>> ?s ?p ?o .
>> }
>> }
> See also discussion here:
> http://answers.semanticweb.com/questions/393/clarification-of-meaning-of-graph-clause-in-sparql-with-no-from-clause
>
> Paolo
Yes one possibility that me and Andy raised in that discussion was the
use of the following:
SELECT DISTINCT ?g WHERE { GRAPH ?g { } }
Since GRAPH ?g is defined as an iteration over all graphs in the dataset
(which may of course be modified by the presence of FROM and FROM NAMED)
and the empty graph pattern returns a single empty solution (i.e. always
matches) then on paper at least this query should do the same job and be
much more performant. Whether this query works may vary depending on
how accurately an engine actually implements the SPARQL spec because the
whole dataset/GRAPH interaction is one of the areas prone to ambiguities
in the spec and differences of opinion between implementers
Rob
Re: Distinct graphs
Posted by Paolo Castagna <ca...@googlemail.com>.
Sarven Capadisli wrote:
> I'm doing the following query:
>
> SELECT DISTINCT ?g
> WHERE {
> GRAPH ?g {
> ?s ?p ?o .
> }
> }
See also discussion here:
http://answers.semanticweb.com/questions/393/clarification-of-meaning-of-graph-clause-in-sparql-with-no-from-clause
Paolo