You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by Julien Plu <ju...@redaction-developpez.com> on 2017/10/02 09:30:59 UTC

Querying TDB takes ages

Hello,

The code I'm using can be found here:
https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3

My problem is that one iteration of my loop (line 88) takes a very long
time (between 3 and 5 minutes), and I don't understand why.

I think it is because I'm certainly missing something in the usage of TDB,
but I don't see what.

The dataset is DBpedia.

Thanks in advance for any light.

Regards.

*Julien Plu*
PhD Student, EURECOM
plu.julien@gmail.com | julien.plu@eurecom.fr
*http://jplu.github.io* <http://jplu.github.io/>
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>

Re: Querying TDB takes ages

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Yes exactly

 

Rob

 

From: Julien Plu <pl...@gmail.com>
Reply-To: <us...@jena.apache.org>
Date: Monday, 2 October 2017 11:06
To: <us...@jena.apache.org>
Subject: Re: Querying TDB takes ages

 

Thanks Rob for your quick reply!

 

hummm I see, what you are saying indeed makes sense, so what you propose is to have a query like this? 

 

PREFIX dc: <http://purl.org/dc/elements/1.1/>

PREFIX foaf: <http://xmlns.com/foaf/0.1/>

PREFIX dbo: <http://dbpedia.org/ontology/>

PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>

SELECT DISTINCT ?p (GROUP_CONCAT(DISTINCT ?o;separator="-----") AS ?vals) ?id ?pr ?link WHERE {

    {

        SELECT DISTINCT ?link (STR(?o3) AS ?id) (STR(?o2) AS ?pr) WHERE {

            ?link dbo:wikiPageRank ?o2 .

            ?link dbo:wikiPageID ?o3 .

            FILTER NOT EXISTS{?link dbo:wikiPageRedirects ?x} .

            FILTER NOT EXISTS{?link dbo:wikiPageDisambiguates ?y} .

        } LIMIT 1 OFFSET %offset

    }

    {

        ?link ?p ?o .

        FILTER(DATATYPE(?o) = xsd:string || LANG(?o) = "en") .

    } UNION {

        VALUES ?p {dbo:wikiPageRedirects dbo:wikiPageDisambiguates} .

        ?x ?p ?link .

        ?x rdfs:label ?o .

    } UNION {

        VALUES ?p {rdf:type} .

        ?link ?p ?o .

        FILTER(CONTAINS(STR(?o), "http://dbpedia.org/ontology/")) .

    }

} GROUP BY ?p ?id ?pr ?link

 

 


Julien Plu 
PhD Student, EURECOM
plu.julien@gmail.com | julien.plu@eurecom.fr
http://jplu.github.io
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103
 

 

 

Le 2 oct. 2017 à 11:58, Rob Vesse <rv...@dotnetrdf.org> a écrit :

 

Julien

At a glance your query is very broad in that it effectively selects the entire dataset and applies string filters over the data e.g. the CONTAINS filter.

This will force TDB to read pretty much the entire dataset on every single query.You may be better off moving the subquery with the limit on it to the start of your query as then TDB can probably use the single result to limit the amount of data it has to read to answer the rest of your query.

Rob

On 02/10/2017 10:30, "Julien Plu" <plu.julien@gmail.com on behalf of julien.plu@redaction-developpez.com> wrote:

   Hello,

   The code I'm using can be found here:
   https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3

   My problem is that one iteration of my loop (line 88) takes a very long
   time (between 3 and 5 minutes), and I don't understand why.

   I think it is because I'm certainly missing something in the usage of TDB,
   but I don't see what.

   The dataset is DBpedia.

   Thanks in advance for any light.

   Regards.

   *Julien Plu*
   PhD Student, EURECOM
   plu.julien@gmail.com | julien.plu@eurecom.fr
   *http://jplu.github.io* <http://jplu.github.io/>
   Campus SophiaTech
   450 route des Chappes
   06410 Biot, France
   Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>

Re: Querying TDB takes ages

Posted by Lorenz Buehmann <bu...@informatik.uni-leipzig.de>.


On 02.10.2017 12:06, Julien Plu wrote:
> UNION {
>         VALUES ?p {rdf:type} .
>         ?link ?p ?o .
>         FILTER(CONTAINS(STR(?o), "http://dbpedia.org/ontology/")) .
>     }

I would rewrite this. Given that you loaded the DBpedia ontology, you
can use

UNION {
 ?o a owl:Class .
 ?link rdf:type ?o .
}

The reason why this should work: I guess you simply want to avoid
classes from YAGO and schema.org, and those can be filtered out by the
triple pattern

?o a owl:Class

because only for the DBpedia classes matching triples are present in the
DBpedia ontology.



Cheers,

Lorenz

Re: Querying TDB takes ages

Posted by Julien Plu <pl...@gmail.com>.

Thanks Rob for your quick reply!

hummm I see, what you are saying indeed makes sense, so what you propose is to have a query like this?

PREFIX dc: <http://purl.org/dc/elements/1.1/>
PREFIX foaf: <http://xmlns.com/foaf/0.1/>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
SELECT DISTINCT ?p (GROUP_CONCAT(DISTINCT ?o;separator="-----") AS ?vals) ?id ?pr ?link WHERE {
    {
        SELECT DISTINCT ?link (STR(?o3) AS ?id) (STR(?o2) AS ?pr) WHERE {
            ?link dbo:wikiPageRank ?o2 .
            ?link dbo:wikiPageID ?o3 .
            FILTER NOT EXISTS{?link dbo:wikiPageRedirects ?x} .
            FILTER NOT EXISTS{?link dbo:wikiPageDisambiguates ?y} .
        } LIMIT 1 OFFSET %offset
    }
    {
        ?link ?p ?o .
        FILTER(DATATYPE(?o) = xsd:string || LANG(?o) = "en") .
    } UNION {
        VALUES ?p {dbo:wikiPageRedirects dbo:wikiPageDisambiguates} .
        ?x ?p ?link .
        ?x rdfs:label ?o .
    } UNION {
        VALUES ?p {rdf:type} .
        ?link ?p ?o .
        FILTER(CONTAINS(STR(?o), "http://dbpedia.org/ontology/")) .
    }
} GROUP BY ?p ?id ?pr ?link



Julien Plu
PhD Student, EURECOM
plu.julien@gmail.com <ma...@gmail.com> | julien.plu@eurecom.fr <ma...@eurecom.fr>
http://jplu.github.io <http://jplu.github.io/>
Campus SophiaTech
450 route des Chappes
06410 Biot, France
Phone: +33 (0) 4 93008103 <tel:+33%20(0)4%2093008103>








> Le 2 oct. 2017 à 11:58, Rob Vesse <rv...@dotnetrdf.org> a écrit :
> 
> Julien
> 
> At a glance your query is very broad in that it effectively selects the entire dataset and applies string filters over the data e.g. the CONTAINS filter.
> 
> This will force TDB to read pretty much the entire dataset on every single query.You may be better off moving the subquery with the limit on it to the start of your query as then TDB can probably use the single result to limit the amount of data it has to read to answer the rest of your query.
> 
> Rob
> 
> On 02/10/2017 10:30, "Julien Plu" <plu.julien@gmail.com on behalf of julien.plu@redaction-developpez.com> wrote:
> 
>    Hello,
> 
>    The code I'm using can be found here:
>    https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3
> 
>    My problem is that one iteration of my loop (line 88) takes a very long
>    time (between 3 and 5 minutes), and I don't understand why.
> 
>    I think it is because I'm certainly missing something in the usage of TDB,
>    but I don't see what.
> 
>    The dataset is DBpedia.
> 
>    Thanks in advance for any light.
> 
>    Regards.
> 
>    *Julien Plu*
>    PhD Student, EURECOM
>    plu.julien@gmail.com | julien.plu@eurecom.fr
>    *http://jplu.github.io* <http://jplu.github.io/>
>    Campus SophiaTech
>    450 route des Chappes
>    06410 Biot, France
>    Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>
> 
> 
> 
> 
>

Re: Querying TDB takes ages

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Julien

At a glance your query is very broad in that it effectively selects the entire dataset and applies string filters over the data e.g. the CONTAINS filter.

This will force TDB to read pretty much the entire dataset on every single query.You may be better off moving the subquery with the limit on it to the start of your query as then TDB can probably use the single result to limit the amount of data it has to read to answer the rest of your query.

Rob

On 02/10/2017 10:30, "Julien Plu" <plu.julien@gmail.com on behalf of julien.plu@redaction-developpez.com> wrote:

    Hello,
    
    The code I'm using can be found here:
    https://gist.github.com/jplu/9d3aa4075145e31c2882f3372b1be3e3
    
    My problem is that one iteration of my loop (line 88) takes a very long
    time (between 3 and 5 minutes), and I don't understand why.
    
    I think it is because I'm certainly missing something in the usage of TDB,
    but I don't see what.
    
    The dataset is DBpedia.
    
    Thanks in advance for any light.
    
    Regards.
    
    *Julien Plu*
    PhD Student, EURECOM
    plu.julien@gmail.com | julien.plu@eurecom.fr
    *http://jplu.github.io* <http://jplu.github.io/>
    Campus SophiaTech
    450 route des Chappes
    06410 Biot, France
    Phone: +33 (0) 4 93008103 <+33%20(0)4%2093008103>