You are viewing a plain text version of this content. The canonical link for it is here.
Posted to users@jena.apache.org by Kunal Jain <ku...@ezeiatech.com> on 2012/04/19 16:42:04 UTC

Filter regex queries are very slow

Dear All,

 

I am developing an application using Jena along with TDB store. I have
loaded around 4 million triples in my store. A small subset of my triples is
as follows:

 

<http://mycompany/ontolgy/ext/#AD> <http://mycompany/ontology#name> "AD" .

<http://mycompany/ontolgy/ext/#AD> <http://mycompany/ontology#type>
"category" .

<http://mycompany/ontolgy/ext/#AD1003> <http://mycompany/ontology#name>
"AD100" .

<http://mycompany/ontolgy/ext/#AD1003> <http://mycompany/ontology#type>
"product" .

<http://mycompany/ontolgy/ext/#AD1003> <http://mycompany/ontology#belongsTo>
<http://mycompany/ontolgy/ext/#AD> .

<http://mycompany/ontolgy/ext/#light1> <http://mycompany/ontology#name>
"Light" .

<http://mycompany/ontolgy/ext/#light1> <http://mycompany/ontology#type>
"item" .

<http://mycompany/ontolgy/ext/#light1> <http://mycompany/ontology#belongsTo>
<http://mycompany/ontolgy/ext/#AD1003> .

<http://mycompany/ontolgy/ext/#light1> <http://mycompany/ontology#value>
"42.5833" .

 

Now I want to do a free text matching for autosuggest kind functionality. I
have got this query to run against my store

 

                           PREFIX gs: <http://mycompany/ontolgy/ext/#>\n\n

                           PREFIX vs: <http://mycompany/ontology#>\n\n

                           SELECT ?subjectName 

                            WHERE {  

                             ?subject vs:name ?subjectName

                            FILTER regex(?subjectName, \"^Light\", \"i\")

                            } LIMIT 10

 

 

I this query I am trying to find triples which start with a particular word,
i.e starting with 'Light'. Execution of this query is taking around 20
seconds.

 

I am using jena core 2.7, jena arq 2.9 and jena tdb 0.9.

 

Need help in figuring out how can this be optimized. 

 

 

Thanks in advance.

 

Regards

Kunal

 

 

  _____  

No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.1913 / Virus Database: 2411/4946 - Release Date: 04/19/12


Re: Filter regex queries are very slow

Posted by Paolo Castagna <ca...@googlemail.com>.
Kunal Jain wrote:
> Hi,
> 
> Thanks for the information. I am trying to have LARQ integration in the
> application. 
> Damian, have you released the latest version? Currently, version which is
> available is a bit old.

My fault, reading the documentation right now to make sure I follow the
correct process and I push the right buttons.

Paolo

> 
> Thanks Again
> 
> Kunal
> 
> -----Original Message-----
> From: Paolo Castagna [mailto:castagna.lists@googlemail.com] 
> Sent: 20 April 2012 00:25
> To: jena-users@incubator.apache.org
> Subject: Re: Filter regex queries are very slow
> 
> Hi
> 
> Damian Steer wrote:
>> LARQ adds a proper free text index [1] which should be much better. This
> is now a separate module, I believe (Paolo?).
>> Personally I've used a separate trie index for autocompletion, since it
> tends to get hammered.
> 
> Correct, LARQ is a separate module now (and I still need to finalize the
> 1.0.0 release which has been voted and approved). Apologies, I have not done
> it yet because it's my first time and I do not want to make mistakes (I
> still need to re-read the documentation).
> 
> I'll do this over the week-end at the latest.
> 
> Cheers,
> Paolo
> 
>> [1] <http://incubator.apache.org/jena/documentation/larq/index.html>
> 
> -----
> No virus found in this message.
> Checked by AVG - www.avg.com
> Version: 2012.0.1913 / Virus Database: 2411/4953 - Release Date: 04/22/12
> 


RE: Filter regex queries are very slow

Posted by Kunal Jain <ku...@ezeiatech.com>.
Hi,

Thanks for the information. I am trying to have LARQ integration in the
application. 
Damian, have you released the latest version? Currently, version which is
available is a bit old.

Thanks Again

Kunal

-----Original Message-----
From: Paolo Castagna [mailto:castagna.lists@googlemail.com] 
Sent: 20 April 2012 00:25
To: jena-users@incubator.apache.org
Subject: Re: Filter regex queries are very slow

Hi

Damian Steer wrote:
> LARQ adds a proper free text index [1] which should be much better. This
is now a separate module, I believe (Paolo?).
> Personally I've used a separate trie index for autocompletion, since it
tends to get hammered.

Correct, LARQ is a separate module now (and I still need to finalize the
1.0.0 release which has been voted and approved). Apologies, I have not done
it yet because it's my first time and I do not want to make mistakes (I
still need to re-read the documentation).

I'll do this over the week-end at the latest.

Cheers,
Paolo

> [1] <http://incubator.apache.org/jena/documentation/larq/index.html>

-----
No virus found in this message.
Checked by AVG - www.avg.com
Version: 2012.0.1913 / Virus Database: 2411/4953 - Release Date: 04/22/12


Re: Filter regex queries are very slow

Posted by Paolo Castagna <ca...@googlemail.com>.
Hi

Damian Steer wrote:
> LARQ adds a proper free text index [1] which should be much better. This is now a separate module, I believe (Paolo?).
> Personally I've used a separate trie index for autocompletion, since it tends to get hammered.

Correct, LARQ is a separate module now (and I still need to finalize the 1.0.0
release which has been voted and approved). Apologies, I have not done it yet
because it's my first time and I do not want to make mistakes (I still need to
re-read the documentation).

I'll do this over the week-end at the latest.

Cheers,
Paolo

> [1] <http://incubator.apache.org/jena/documentation/larq/index.html>


Re: Filter regex queries are very slow

Posted by Damian Steer <d....@bristol.ac.uk>.
On 19 Apr 2012, at 16:48, Rob Vesse wrote:

> Trie indexes are the simplest way to do prefix searches for auto-completion but they require you to do all the implementation yourselves because AFAIK we don't have a drop in Trie index module for ARQ
> 
> Damian - Is your Trie index code you could share?
> 
> Rob


The trie is part of a simple autocompletor web application that talks to sparql endpoints, rather than an ARQ-accessible index.The trie code isn't mine, and the original seems to have vanished from the web.

I could put it up on github if anyone is interested.

Damian

Re: Filter regex queries are very slow

Posted by Rob Vesse <ra...@ecs.soton.ac.uk>.
+1 to both suggestions

LARQ has the benefit that it gives you proper full text search so you 
can do more than simple auto-completion (rankings, result limits, full 
lucene queries) and that it is a relatively easy to drop in module

Trie indexes are the simplest way to do prefix searches for 
auto-completion but they require you to do all the implementation 
yourselves because AFAIK we don't have a drop in Trie index module for ARQ

Damian - Is your Trie index code you could share?

Rob

On 4/19/12 8:12 AM, Damian Steer wrote:
> On 19 Apr 2012, at 15:42, Kunal Jain wrote:
>
>> Dear All,
> Hi there,
>
>> I am developing an application using Jena along with TDB store. I have
>> loaded around 4 million triples in my store. A small subset of my triples is
>> as follows:
> ...
>
>> Now I want to do a free text matching for autosuggest kind functionality. I
>> have got this query to run against my store
>>                              ?subject vs:name ?subjectName
>>
>>                             FILTER regex(?subjectName, \"^Light\", \"i\")
>> I this query I am trying to find triples which start with a particular word,
>> i.e starting with 'Light'. Execution of this query is taking around 20
>> seconds.
> While it's possible this could be improved, it's never going to be great since the search is unindexed.
>
> LARQ adds a proper free text index [1] which should be much better. This is now a separate module, I believe (Paolo?).
> Personally I've used a separate trie index for autocompletion, since it tends to get hammered.
>
> Damian
>
> [1]<http://incubator.apache.org/jena/documentation/larq/index.html>


Re: Filter regex queries are very slow

Posted by Damian Steer <d....@bristol.ac.uk>.
On 19 Apr 2012, at 15:42, Kunal Jain wrote:

> Dear All,

Hi there,

> I am developing an application using Jena along with TDB store. I have
> loaded around 4 million triples in my store. A small subset of my triples is
> as follows:

...

> Now I want to do a free text matching for autosuggest kind functionality. I
> have got this query to run against my store

>                             ?subject vs:name ?subjectName
> 
>                            FILTER regex(?subjectName, \"^Light\", \"i\")

> I this query I am trying to find triples which start with a particular word,
> i.e starting with 'Light'. Execution of this query is taking around 20
> seconds.

While it's possible this could be improved, it's never going to be great since the search is unindexed.

LARQ adds a proper free text index [1] which should be much better. This is now a separate module, I believe (Paolo?).
Personally I've used a separate trie index for autocompletion, since it tends to get hammered.

Damian

[1] <http://incubator.apache.org/jena/documentation/larq/index.html>