You are viewing a plain text version of this content. The canonical link for it is here.

Posted to users@jena.apache.org by "Paton, Diego" <di...@teamaol.com> on 2014/03/07 12:22:35 UTC

sparql query performance

Hi,

I am working with the Freebase ontology stored in Apache JENA TDB and executing queries using Fuseki.

What I want to retrieve is the mID, entity name, description and optionally the wikipedia url if present ( I expect to obtain more than 6M of results ). The problem is the query takes more than 24h to run.


prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?mID ?e ?nf ?desc  ?wikipedia_url
where
{
    {
       ?mID fb:type.object.type fb:people.person .
       ?mID fb:type.object.name ?e .
       ?mID fb:common.topic.notable_for ?notab_for .
       ?notab_for fb:common.notable_for.display_name ?nf .
       ?mID fb:common.topic.description ?desc .

       OPTIONAL
       {
          ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
          FILTER (regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
       }

       FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
    }

}

ORDER BY ?mID


This modified query below with optional attribute removed takes 3h.

prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?mID ?e ?nf ?desc
where
{
    {
       ?mID fb:type.object.type fb:people.person .
       ?mID fb:type.object.name ?e .
       ?mID fb:common.topic.notable_for ?notab_for .
       ?notab_for fb:common.notable_for.display_name ?nf .
       ?mID_raw fb:common.topic.description ?desc .

       FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
    }
    BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID)
}

ORDER BY ?mID

And the modified query below with filter removed in the optional clause takes more than 20h ( still running )


prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?mID ?e ?nf ?desc  ?wikipedia_url
where
{
    {
       ?mID fb:type.object.type fb:people.person .
       ?mID fb:type.object.name ?e .
       ?mID fb:common.topic.notable_for ?notab_for .
       ?notab_for fb:common.notable_for.display_name ?nf .
       ?mID fb:common.topic.description ?desc .

       OPTIONAL
       {
          ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
       }

       FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
    }

}

ORDER BY ?mID

Do you have some ideas about how to improve the performance of the first query that is the one meets my requirements?

Regards,

Diego.

Re: sparql query performance

Posted by Rob Vesse <rv...@dotnetrdf.org>.

Diego

Firstly do you always need to get all the results at once?

One possible option is to add LIMIT and OFFSET clauses, if you do this
then ARQ may apply the Top N sorting optimisation.  It does if you add a
LIMIT to the query you gave in your original email.

When this is used the query engine doesn't have to calculate all
intermediate solutions and hold them in memory prior to sorting as it does
with your current query.  With the Top N sorting optimisation it only
remembers the top N results it has seen evicting results that aren't in
the top N as it goes along which will reduce memory usage substantially
and possibly speed things up.  However it still has to calculate all
intermediate results so may not make much difference.

Secondly your filter expressions are quite expensive since you are
applying regular expressions to every possible solution to filter them
down.

Since you appear to be doing simple string contains checks with your
regular expressions I would start by converting to using the CONTAINS()
function instead, you can use CONTAINS(LCASE(?var)) if you want to get a
case insensitive match.  See
http://www.w3.org/TR/sparql11-query/#func-contains

I would also take a look at using the jena-text module
(http://jena.apache.org/documentation/query/text-query.html) and building
a free text index which can be used in your query instead of regular
expressions and would likely substantially improve performance.

Rob

On 12/03/2014 10:30, "Andy Seaborne" <an...@apache.org> wrote:

>On 11/03/14 11:45, Paton, Diego wrote:
>> Hi,
>>
>> The problem is the ORDER BY clause. I am expecting to obtain 8M of
>>results and it makes the ordering very slow ( more than 26h ). Without
>>ordering takes less than 3h.
>>
>> I need the resultSet ordered by ?mID because I can't order it in a post
>>process due to memory problems.
>>
>> Do you know if is possible to improve the performance of the query with
>>the order by clause?
>
>Hi Diego,
>
>What hardware are you running on (how much RAM does the machine have)?
>and how much heap space does JVM have?  One cause of slow sorts I've
>seen before is the machine swapping.
>
>There is some support for an external sort (i.e. using temporary files).
>  See the javadoc for ARQ.spillToDiskThreshold.  That may help
>especially if you are running into query engine memory size problems but
>also generally.  But it's pragmatic - it may be slower.
>
>	Andy
>

Re: sparql query performance

Posted by Andy Seaborne <an...@apache.org>.

On 11/03/14 11:45, Paton, Diego wrote:
> Hi,
>
> The problem is the ORDER BY clause. I am expecting to obtain 8M of results and it makes the ordering very slow ( more than 26h ). Without ordering takes less than 3h.
>
> I need the resultSet ordered by ?mID because I can't order it in a post process due to memory problems.
>
> Do you know if is possible to improve the performance of the query with the order by clause?

Hi Diego,

What hardware are you running on (how much RAM does the machine have)? 
and how much heap space does JVM have?  One cause of slow sorts I've 
seen before is the machine swapping.

There is some support for an external sort (i.e. using temporary files). 
  See the javadoc for ARQ.spillToDiskThreshold.  That may help 
especially if you are running into query engine memory size problems but 
also generally.  But it's pragmatic - it may be slower.

	Andy

Re: sparql query performance

Posted by "Paton, Diego" <di...@teamaol.com>.

Hi,

The problem is the ORDER BY clause. I am expecting to obtain 8M of results and it makes the ordering very slow ( more than 26h ). Without ordering takes less than 3h.

I need the resultSet ordered by ?mID because I can't order it in a post process due to memory problems.

Do you know if is possible to improve the performance of the query with the order by clause?


> prefix fb: <http://rdf.freebase.com/ns/>
> prefix fn: <http://www.w3.org/2005/xpath-functions#>
> select ?mID ?e ?nf ?desc  ?wikipedia_url
> where
> {
>   {
>       ?mID fb:type.object.type fb:people.person .
>       ?mID fb:type.object.name ?e .
>       ?mID fb:common.topic.notable_for ?notab_for .
>       ?notab_for fb:common.notable_for.display_name ?nf .
>       ?mID fb:common.topic.description ?desc .
>       FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>    }
> 
>    OPTIONAL
>    {
>      ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>      FILTER (regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
>    }
> }
> 
> ORDER BY ?mID


Regards,

Diego.


On 7 Mar 2014, at 22:16, Andy Seaborne <an...@apache.org> wrote:

> You can try forcing the scope of the filter to be like your second query then do the optional part:
> 
> prefix fb: <http://rdf.freebase.com/ns/>
> prefix fn: <http://www.w3.org/2005/xpath-functions#>
> select ?mID ?e ?nf ?desc  ?wikipedia_url
> where
> {
>   {
>       ?mID fb:type.object.type fb:people.person .
>       ?mID fb:type.object.name ?e .
>       ?mID fb:common.topic.notable_for ?notab_for .
>       ?notab_for fb:common.notable_for.display_name ?nf .
>       ?mID fb:common.topic.description ?desc .
>       FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>    }
> 
>    OPTIONAL
>    {
>      ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>      FILTER (regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
>    }
> }
> 
> which may be more like the 3h query.
> 
> There are improvements in progress for this, but they haven't reached TDB yet.
> 
> The hardware you are running will be a big factor.
> 
> 	Andy
> 
> 
> 
> On 07/03/14 11:22, Paton, Diego wrote:
>> 
>> Hi,
>> 
>> I am working with the Freebase ontology stored in Apache JENA TDB and executing queries using Fuseki.
>> 
>> What I want to retrieve is the mID, entity name, description and optionally the wikipedia url if present ( I expect to obtain more than 6M of results ). The problem is the query takes more than 24h to run.
>> 
>> 
>> prefix fb: <http://rdf.freebase.com/ns/>
>> prefix fn: <http://www.w3.org/2005/xpath-functions#>
>> select ?mID ?e ?nf ?desc  ?wikipedia_url
>> where
>> {
>>     {
>>        ?mID fb:type.object.type fb:people.person .
>>        ?mID fb:type.object.name ?e .
>>        ?mID fb:common.topic.notable_for ?notab_for .
>>        ?notab_for fb:common.notable_for.display_name ?nf .
>>        ?mID fb:common.topic.description ?desc .
>> 
>>        OPTIONAL
>>        {
>>           ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>>           FILTER (regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
>>        }
>> 
>>        FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>>     }
>> 
>> }
>> 
>> ORDER BY ?mID
>> 
>> 
>> This modified query below with optional attribute removed takes 3h.
>> 
>> prefix fb: <http://rdf.freebase.com/ns/>
>> prefix fn: <http://www.w3.org/2005/xpath-functions#>
>> select ?mID ?e ?nf ?desc
>> where
>> {
>>     {
>>        ?mID fb:type.object.type fb:people.person .
>>        ?mID fb:type.object.name ?e .
>>        ?mID fb:common.topic.notable_for ?notab_for .
>>        ?notab_for fb:common.notable_for.display_name ?nf .
>>        ?mID_raw fb:common.topic.description ?desc .
>> 
>>        FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>>     }
>>     BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID)
>> }
>> 
>> ORDER BY ?mID
>> 
>> And the modified query below with filter removed in the optional clause takes more than 20h ( still running )
>> 
>> 
>> prefix fb: <http://rdf.freebase.com/ns/>
>> prefix fn: <http://www.w3.org/2005/xpath-functions#>
>> select ?mID ?e ?nf ?desc  ?wikipedia_url
>> where
>> {
>>     {
>>        ?mID fb:type.object.type fb:people.person .
>>        ?mID fb:type.object.name ?e .
>>        ?mID fb:common.topic.notable_for ?notab_for .
>>        ?notab_for fb:common.notable_for.display_name ?nf .
>>        ?mID fb:common.topic.description ?desc .
>> 
>>        OPTIONAL
>>        {
>>           ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>>        }
>> 
>>        FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>>     }
>> 
>> }
>> 
>> ORDER BY ?mID
>> 
>> Do you have some ideas about how to improve the performance of the first query that is the one meets my requirements?
>> 
>> Regards,
>> 
>> Diego.
>> 
>> 
>> 
>

Re: sparql query performance

Posted by Andy Seaborne <an...@apache.org>.

You can try forcing the scope of the filter to be like your second query 
then do the optional part:

prefix fb: <http://rdf.freebase.com/ns/>
prefix fn: <http://www.w3.org/2005/xpath-functions#>
select ?mID ?e ?nf ?desc  ?wikipedia_url
where
{
    {
        ?mID fb:type.object.type fb:people.person .
        ?mID fb:type.object.name ?e .
        ?mID fb:common.topic.notable_for ?notab_for .
        ?notab_for fb:common.notable_for.display_name ?nf .
        ?mID fb:common.topic.description ?desc .
        FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), 
"en") && langMatches(lang(?desc), "en"))
     }

     OPTIONAL
     {
       ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
       FILTER (regex (str(?wikipedia_url), "en.wikipedia", "i") && 
!regex (str(?wikipedia_url), "curid=", "i")) .
     }
}

which may be more like the 3h query.

There are improvements in progress for this, but they haven't reached 
TDB yet.

The hardware you are running will be a big factor.

	Andy



On 07/03/14 11:22, Paton, Diego wrote:
>
> Hi,
>
> I am working with the Freebase ontology stored in Apache JENA TDB and executing queries using Fuseki.
>
> What I want to retrieve is the mID, entity name, description and optionally the wikipedia url if present ( I expect to obtain more than 6M of results ). The problem is the query takes more than 24h to run.
>
>
> prefix fb: <http://rdf.freebase.com/ns/>
> prefix fn: <http://www.w3.org/2005/xpath-functions#>
> select ?mID ?e ?nf ?desc  ?wikipedia_url
> where
> {
>      {
>         ?mID fb:type.object.type fb:people.person .
>         ?mID fb:type.object.name ?e .
>         ?mID fb:common.topic.notable_for ?notab_for .
>         ?notab_for fb:common.notable_for.display_name ?nf .
>         ?mID fb:common.topic.description ?desc .
>
>         OPTIONAL
>         {
>            ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>            FILTER (regex (str(?wikipedia_url), "en.wikipedia", "i") && !regex (str(?wikipedia_url), "curid=", "i")) .
>         }
>
>         FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>      }
>
> }
>
> ORDER BY ?mID
>
>
> This modified query below with optional attribute removed takes 3h.
>
> prefix fb: <http://rdf.freebase.com/ns/>
> prefix fn: <http://www.w3.org/2005/xpath-functions#>
> select ?mID ?e ?nf ?desc
> where
> {
>      {
>         ?mID fb:type.object.type fb:people.person .
>         ?mID fb:type.object.name ?e .
>         ?mID fb:common.topic.notable_for ?notab_for .
>         ?notab_for fb:common.notable_for.display_name ?nf .
>         ?mID_raw fb:common.topic.description ?desc .
>
>         FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>      }
>      BIND(REPLACE(str(?mID_raw), "http://rdf.freebase.com/ns/", "") as ?mID)
> }
>
> ORDER BY ?mID
>
> And the modified query below with filter removed in the optional clause takes more than 20h ( still running )
>
>
> prefix fb: <http://rdf.freebase.com/ns/>
> prefix fn: <http://www.w3.org/2005/xpath-functions#>
> select ?mID ?e ?nf ?desc  ?wikipedia_url
> where
> {
>      {
>         ?mID fb:type.object.type fb:people.person .
>         ?mID fb:type.object.name ?e .
>         ?mID fb:common.topic.notable_for ?notab_for .
>         ?notab_for fb:common.notable_for.display_name ?nf .
>         ?mID fb:common.topic.description ?desc .
>
>         OPTIONAL
>         {
>            ?mID fb:common.topic.topic_equivalent_webpage ?wikipedia_url .
>         }
>
>         FILTER (langMatches(lang(?e), "en") && langMatches(lang(?nf), "en") && langMatches(lang(?desc), "en"))
>      }
>
> }
>
> ORDER BY ?mID
>
> Do you have some ideas about how to improve the performance of the first query that is the one meets my requirements?
>
> Regards,
>
> Diego.
>
>
>