You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Clemens Wyss DEV <cl...@mysign.ch> on 2012/12/07 10:16:04 UTC

Alternative for WildcardQuery with leading *

In order to provide suggestions our query also includes a "WildcardQuery with a leading *", which, of course, has a HUGE performance impact :-(

E.g.
Say we have indexed "vacancyplan", then if a user typed "plan" he should also be offered "vacancyplan" ...

How can this feature be implemented without WcQwl* ?

Any help appreciated
- Clemens

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Alternative for WildcardQuery with leading *

Posted by Oliver Christ <oc...@ebscohost.com>.
If I remember correctly it was Baeza-Yates or someone in his group at U Santiago who came up with the rotated term indexing. 

Indexing "abc", you explicitly mark end of string and index all rotations using a data structure which supports prefix search (such as a trie):

abc$
bc$a
c$ab
$abc

This index directly supports all prefix, suffix, circumfix, and infix wildcard searches (a*, *c, a*c, *b*) by transforming the query:

a* 	-> $a		(all terms starting with a)
a*c 	-> c$a	(all terms starting with a and ending in c)
*c	-> c$		(all terms ending with c)
*b*	-> b		(all terms containing b)

You then simply run a prefix search for the transformed query in the trie of rotated strings. Quite similar to suffix arrays, but adds circumfix support (a*c) which, if I remember correctly, are not directly supported by standard suffix arrays. One can mark up start and end of the indexed term if the same trie should also be used for exact match ($abc$).

Cheers, Oli


-----Original Message-----
From: Uwe Schindler [mailto:uwe@thetaphi.de] 
Sent: Friday, December 07, 2012 4:34 AM
To: java-user@lucene.apache.org
Subject: RE: Alternative for WildcardQuery with leading *

In general, you seem to need decomposing...: vacancyplan -> tokenized to -> vacancyplan, vacancy, plan. Wildcards are in general not really a replacement for correct text analysis on the indexing side. Unfortunately, decomposing is a hard task, but there are dictionary-based algorithms for e.g. German.

About the question with wildcards: If you really need a wildcard  in front, the common use case is to also index all terms in backwards order. E.g. Solr does this with ReverseWildCardTokenFilter, which indexes all terms forward and also in reverse order with some extra "marker-prefix" on the reverse terms (to be able to differentiate between forwards and backwards terms in the same field - of course, you could also index the reverse terms into an extra field). The query parser then automatically produces a simple PrefixQuery with a transformed query term (so it queries the magic extra reverse terms with a prefix).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
> Sent: Friday, December 07, 2012 10:16 AM
> To: java-user@lucene.apache.org
> Subject: Alternative for WildcardQuery with leading *
> 
> In order to provide suggestions our query also includes a 
> "WildcardQuery with a leading *", which, of course, has a HUGE 
> performance impact :-(
> 
> E.g.
> Say we have indexed "vacancyplan", then if a user typed "plan" he 
> should also be offered "vacancyplan" ...
> 
> How can this feature be implemented without WcQwl* ?
> 
> Any help appreciated
> - Clemens
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Alternative for WildcardQuery with leading *

Posted by Uwe Schindler <uw...@thetaphi.de>.
In general, you seem to need decomposing...: vacancyplan -> tokenized to -> vacancyplan, vacancy, plan. Wildcards are in general not really a replacement for correct text analysis on the indexing side. Unfortunately, decomposing is a hard task, but there are dictionary-based algorithms for e.g. German.

About the question with wildcards: If you really need a wildcard  in front, the common use case is to also index all terms in backwards order. E.g. Solr does this with ReverseWildCardTokenFilter, which indexes all terms forward and also in reverse order with some extra "marker-prefix" on the reverse terms (to be able to differentiate between forwards and backwards terms in the same field - of course, you could also index the reverse terms into an extra field). The query parser then automatically produces a simple PrefixQuery with a transformed query term (so it queries the magic extra reverse terms with a prefix).

Uwe

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Clemens Wyss DEV [mailto:clemensdev@mysign.ch]
> Sent: Friday, December 07, 2012 10:16 AM
> To: java-user@lucene.apache.org
> Subject: Alternative for WildcardQuery with leading *
> 
> In order to provide suggestions our query also includes a "WildcardQuery
> with a leading *", which, of course, has a HUGE performance impact :-(
> 
> E.g.
> Say we have indexed "vacancyplan", then if a user typed "plan" he should
> also be offered "vacancyplan" ...
> 
> How can this feature be implemented without WcQwl* ?
> 
> Any help appreciated
> - Clemens
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


AW: Alternative for WildcardQuery with leading *

Posted by Clemens Wyss DEV <cl...@mysign.ch>.
> Really off the top of my head, if that's an expected query, 
>you can try to index the words backwards (in that field) and 
>then convert the query *plan to nalp* :).
"interesting" approach ... I might give it a try :-) ... no kidding ;)

-----Ursprüngliche Nachricht-----
Von: Shai Erera [mailto:serera@gmail.com] 
Gesendet: Freitag, 7. Dezember 2012 10:20
An: java-user@lucene.apache.org
Betreff: Re: Alternative for WildcardQuery with leading *

Really off the top of my head, if that's an expected query, you can try to index the words backwards (in that field) and then convert the query *plan to nalp* :).

You can also index the suffixes of words, e.g. vacancyplan, acancyplan, cancyplan and so forth, and then convert the query *plan to plan. Note that it increases the lexicon !

Shai

On Fri, Dec 7, 2012 at 11:16 AM, Clemens Wyss DEV <cl...@mysign.ch>wrote:

> In order to provide suggestions our query also includes a 
> "WildcardQuery with a leading *", which, of course, has a HUGE 
> performance impact :-(
>
> E.g.
> Say we have indexed "vacancyplan", then if a user typed "plan" he 
> should also be offered "vacancyplan" ...
>
> How can this feature be implemented without WcQwl* ?
>
> Any help appreciated
> - Clemens
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Alternative for WildcardQuery with leading *

Posted by Shai Erera <se...@gmail.com>.
Really off the top of my head, if that's an expected query, you can try to
index the words backwards (in that field) and then convert the query *plan
to nalp* :).

You can also index the suffixes of words, e.g. vacancyplan, acancyplan,
cancyplan and so forth, and then convert the query *plan to plan. Note that
it increases the lexicon !

Shai

On Fri, Dec 7, 2012 at 11:16 AM, Clemens Wyss DEV <cl...@mysign.ch>wrote:

> In order to provide suggestions our query also includes a "WildcardQuery
> with a leading *", which, of course, has a HUGE performance impact :-(
>
> E.g.
> Say we have indexed "vacancyplan", then if a user typed "plan" he should
> also be offered "vacancyplan" ...
>
> How can this feature be implemented without WcQwl* ?
>
> Any help appreciated
> - Clemens
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>