You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Julien Plu <ju...@redaction-developpez.com> on 2017/10/02 09:30:59 UTC
Querying TDB takes ages
Hello,
The code I'm using can be found here:
https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3
My problem is that one iteration of my loop (line 88) takes a very long
time (between 3 and 5 minutes), and I don't understand why.
I think it is because I'm certainly missing something in the usage of TDB,
but I don't see what.
The dataset is DBpedia.
Thanks in advance for any light.
Regards.
*Julien Plu*
PhD Student, EURECOM
plu.julien@gmail.com | julien.plu@eurecom.fr
*http://jplu.github.io* <http://jplu.github.io/>
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>
Re: Querying TDB takes ages
Posted by Rob Vesse <rv...@dotnetrdf.org>.
Yes exactly
Rob
From: Julien Plu <pl...@gmail.com>
Reply-To: <us...@jena.apache.org>
Date: Monday, 2 October 2017 11:06
To: <us...@jena.apache.org>
Subject: Re: Querying TDB takes ages
Thanks Rob for your quick reply!
hummm I see, what you are saying indeed makes sense, so what you propose is to have a query like this?
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?p (GROUP_CONCAT(DISTINCT ?o;separator="-----") AS ?vals) ?id ?pr ?link WHERE {
{
SELECT DISTINCT ?link (STR(?o3) AS ?id) (STR(?o2) AS ?pr) WHERE {
?link dbo:wikiPageRank ?o2 .
?link dbo:wikiPageID ?o3 .
FILTER NOT EXISTS{?link dbo:wikiPageRedirects ?x} .
FILTER NOT EXISTS{?link dbo:wikiPageDisambiguates ?y} .
} LIMIT 1 OFFSET %offset
}
{
?link ?p ?o .
FILTER(DATATYPE(?o) = xsd:string || LANG(?o) = "en") .
} UNION {
VALUES ?p {dbo:wikiPageRedirects dbo:wikiPageDisambiguates} .
?x ?p ?link .
?x rdfs:label ?o .
} UNION {
VALUES ?p {rdf:type} .
?link ?p ?o .
FILTER(CONTAINS(STR(?o), "http://dbpedia.org/ontology/")) .
}
} GROUP BY ?p ?id ?pr ?link
Julien Plu
PhD Student, EURECOM
plu.julien@gmail.com | julien.plu@eurecom.fr
http://jplu.github.io
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103
Le 2 oct. 2017 à 11:58, Rob Vesse <rv...@dotnetrdf.org> a écrit :
Julien
At a glance your query is very broad in that it effectively selects the entire dataset and applies string filters over the data e.g. the CONTAINS filter.
This will force TDB to read pretty much the entire dataset on every single query.You may be better off moving the subquery with the limit on it to the start of your query as then TDB can probably use the single result to limit the amount of data it has to read to answer the rest of your query.
Rob
On 02/10/2017 10:30, "Julien Plu" <plu.julien@gmail.com on behalf of julien.plu@redaction-developpez.com> wrote:
Hello,
The code I'm using can be found here:
https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3
My problem is that one iteration of my loop (line 88) takes a very long
time (between 3 and 5 minutes), and I don't understand why.
I think it is because I'm certainly missing something in the usage of TDB,
but I don't see what.
The dataset is DBpedia.
Thanks in advance for any light.
Regards.
*Julien Plu*
PhD Student, EURECOM
plu.julien@gmail.com | julien.plu@eurecom.fr
*http://jplu.github.io* <http://jplu.github.io/>
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>
Re: Querying TDB takes ages
Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.
On 02.10.2017 12:06, Julien Plu wrote:
> UNION {
> VALUES ?p {rdf:type} .
> ?link ?p ?o .
> FILTER(CONTAINS(STR(?o), "http://dbpedia.org/ontology/")) .
> }
I would rewrite this. Given that you loaded the DBpedia ontology, you
can use
UNION {
?o a owl:Class .
?link rdf:type ?o .
}
The reason why this should work: I guess you simply want to avoid
classes from YAGO and schema.org, and those can be filtered out by the
triple pattern
?o a owl:Class
because only for the DBpedia classes matching triples are present in the
DBpedia ontology.
Cheers,
Lorenz
Re: Querying TDB takes ages
Posted by Julien Plu <pl...@gmail.com>.
Thanks Rob for your quick reply!
hummm I see, what you are saying indeed makes sense, so what you propose is to have a query like this?
PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?p (GROUP_CONCAT(DISTINCT ?o;separator="-----") AS ?vals) ?id ?pr ?link WHERE {
{
SELECT DISTINCT ?link (STR(?o3) AS ?id) (STR(?o2) AS ?pr) WHERE {
?link dbo:wikiPageRank ?o2 .
?link dbo:wikiPageID ?o3 .
FILTER NOT EXISTS{?link dbo:wikiPageRedirects ?x} .
FILTER NOT EXISTS{?link dbo:wikiPageDisambiguates ?y} .
} LIMIT 1 OFFSET %offset
}
{
?link ?p ?o .
FILTER(DATATYPE(?o) = xsd:string || LANG(?o) = "en") .
} UNION {
VALUES ?p {dbo:wikiPageRedirects dbo:wikiPageDisambiguates} .
?x ?p ?link .
?x rdfs:label ?o .
} UNION {
VALUES ?p {rdf:type} .
?link ?p ?o .
FILTER(CONTAINS(STR(?o), "http://dbpedia.org/ontology/")) .
}
} GROUP BY ?p ?id ?pr ?link
Julien Plu
PhD Student, EURECOM
plu.julien@gmail.com <ma...@gmail.com> | julien.plu@eurecom.fr <ma...@eurecom.fr>
http://jplu.github.io <http://jplu.github.io/>
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103 <tel:+33%20(0)4%2093008103>
> Le 2 oct. 2017 à 11:58, Rob Vesse <rv...@dotnetrdf.org> a écrit :
>
> Julien
>
> At a glance your query is very broad in that it effectively selects the entire dataset and applies string filters over the data e.g. the CONTAINS filter.
>
> This will force TDB to read pretty much the entire dataset on every single query.You may be better off moving the subquery with the limit on it to the start of your query as then TDB can probably use the single result to limit the amount of data it has to read to answer the rest of your query.
>
> Rob
>
> On 02/10/2017 10:30, "Julien Plu" <plu.julien@gmail.com on behalf of julien.plu@redaction-developpez.com> wrote:
>
> Hello,
>
> The code I'm using can be found here:
> https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3
>
> My problem is that one iteration of my loop (line 88) takes a very long
> time (between 3 and 5 minutes), and I don't understand why.
>
> I think it is because I'm certainly missing something in the usage of TDB,
> but I don't see what.
>
> The dataset is DBpedia.
>
> Thanks in advance for any light.
>
> Regards.
>
> *Julien Plu*
> PhD Student, EURECOM
> plu.julien@gmail.com | julien.plu@eurecom.fr
> *http://jplu.github.io* <http://jplu.github.io/>
> Campus SophiaTech
> 450 route des Chappes
> 06410 Biot, France
> Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>
>
>
>
>
>
Re: Querying TDB takes ages
Posted by Rob Vesse <rv...@dotnetrdf.org>.
Julien
At a glance your query is very broad in that it effectively selects the entire dataset and applies string filters over the data e.g. the CONTAINS filter.
This will force TDB to read pretty much the entire dataset on every single query.You may be better off moving the subquery with the limit on it to the start of your query as then TDB can probably use the single result to limit the amount of data it has to read to answer the rest of your query.
Rob
On 02/10/2017 10:30, "Julien Plu" <plu.julien@gmail.com on behalf of julien.plu@redaction-developpez.com> wrote:
Hello,
The code I'm using can be found here:
https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3
My problem is that one iteration of my loop (line 88) takes a very long
time (between 3 and 5 minutes), and I don't understand why.
I think it is because I'm certainly missing something in the usage of TDB,
but I don't see what.
The dataset is DBpedia.
Thanks in advance for any light.
Regards.
*Julien Plu*
PhD Student, EURECOM
plu.julien@gmail.com | julien.plu@eurecom.fr
*http://jplu.github.io* <http://jplu.github.io/>
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>