You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Floris van Gog <F....@xindao.nl> on 2012/12/27 20:40:42 UTC

How to? speed up wildcard queries

Hello,

With a few examples taken from blogs (that I do not remember, but if it was yours, thanks!) I have managed to get lucene.net working for a small search engine webservice to be used behind a website.. I also added some homemade facetting to it (I guess as solr could have done it, but not as elaborate). The reason to roll up my own was because of pricing (various pricelists and brackets) and stocklevel(future stocks) filtering requirements.

Even though it works, it is far from optimal (in my eyes), and most of the hurt is in the wildcard queries. As the searcher will help customers find products, all terms in a searchquery are automatically pre/post fixed with a *. Not adding the pre/post fixes seriously limits the use of the free text search part. This is business requirement.

[The search uses RAMDirectory storage and test below are always performed in sequence, utilizing a single cpu. Documents are never removed from the index]
The postfix * is still somewhat ok, as I can do about 800 searches/second on a 1500 document index. The text in the documents is not that much (a short description, maybe 2-3 lines)
However, the prefix makes the search throughput drop to about 100 searches/second.
If we put this in retrospect, with no wildcards I can get about 4000 searches/second, and if I only use facets to filter, I can do about 60.000 searches/second.

The query used is a manually made boolean query with WildCardQueries within it on 2 fields in the document using SHOULD.

Is there a way to speed up prefix * wildcard queries somehow? I am currently thinking along the lines of adding a field to the document with the text reversed, and only apply a post-fix wildcard *. Theoretically this should give me about 400 searches/second.

Any input is appreciated,
Floris
DISCLAIMER:
The information contained in this communication is confidential and is intended solely
for the use of the individual or entity to whom it is addressed. If you have received
it by mistake, please let us know by email reply and delete it from your system. You
should not copy, disclose or distribute this communication without the authority of
Xindao BV. Xindao BV is neither liable for the proper and complete transmission of the
information contained in this communication nor for any delay in its receipt. Xindao BV
does not guarantee that the integrity of this communication has been maintained nor that
the communication is free of viruses, interceptions or interference.

Re: How to? speed up wildcard queries

Posted by Noel Lysaght <ly...@outlook.com>.

Your not really looking for a wildcard query. I would think you need to generate an index where every possible forward word combination is possible. For example take the word "small". You need to index that as small, sm, sma, smal, ma, mal, mall, al, all etc....
I'm pretty sure there is a sample ananyzer/tikenizer that can do this. You end up with a bigger index but a lot more power for your searches. 

Cheers
Noel




On 27 Dec 2012, at 19:41, "Floris van Gog" <F....@xindao.nl> wrote:

> Hello,
> 
> With a few examples taken from blogs (that I do not remember, but if it was yours, thanks!) I have managed to get lucene.net working for a small search engine webservice to be used behind a website.. I also added some homemade facetting to it (I guess as solr could have done it, but not as elaborate). The reason to roll up my own was because of pricing (various pricelists and brackets) and stocklevel(future stocks) filtering requirements. 
> 
> Even though it works, it is far from optimal (in my eyes), and most of the hurt is in the wildcard queries. As the searcher will help customers find products, all terms in a searchquery are automatically pre/post fixed with a *. Not adding the pre/post fixes seriously limits the use of the free text search part. This is business requirement.
> 
> [The search uses RAMDirectory storage and test below are always performed in sequence, utilizing a single cpu. Documents are never removed from the index]
> The postfix * is still somewhat ok, as I can do about 800 searches/second on a 1500 document index. The text in the documents is not that much (a short description, maybe 2-3 lines)
> However, the prefix makes the search throughput drop to about 100 searches/second.  
> If we put this in retrospect, with no wildcards I can get about 4000 searches/second, and if I only use facets to filter, I can do about 60.000 searches/second. 
> 
> The query used is a manually made boolean query with WildCardQueries within it on 2 fields in the document using SHOULD.
> 
> Is there a way to speed up prefix * wildcard queries somehow? I am currently thinking along the lines of adding a field to the document with the text reversed, and only apply a post-fix wildcard *. Theoretically this should give me about 400 searches/second. 
> 
> Any input is appreciated,
> Floris
> DISCLAIMER:
> The information contained in this communication is confidential and is intended solely 
> for the use of the individual or entity to whom it is addressed. If you have received 
> it by mistake, please let us know by email reply and delete it from your system. You 
> should not copy, disclose or distribute this communication without the authority of 
> Xindao BV. Xindao BV is neither liable for the proper and complete transmission of the 
> information contained in this communication nor for any delay in its receipt. Xindao BV
> does not guarantee that the integrity of this communication has been maintained nor that 
> the communication is free of viruses, interceptions or interference.
>