You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by mick l <ka...@gmail.com> on 2008/06/24 14:28:53 UTC

Wildcard and Literal Searches combined

Folks,
My users require wildcard searches. Sometimes their search phrases contain
spaces. I am having trouble trying to implement a wildcard search on strings
containing spaces, so if the term includes spaces I force a literal search
by adding double quotes to the search term.
So the search string for 'Dublin' becomes search term (Dublin*)
whereas search string 'Dublin City' becomes ("Dublin City")


If I use (Dublin City*) I get all instances of Dublin OR City in the results
which is not what I am looking for. 

Is there any way I can combine the wildcard search and the literal?

Heres my existing code. Its in c# with Lucene.Net

//if input has spaces we do a literal search
if (sSearchQuery.IndexOf(" ") < 0)
{
sSearchQuery = "(" + sSearchQuery + "*)";
}
else
{
sSearchQuery = "(\"" + sSearchQuery + "\")";
}
IndexSearcher searcher = new IndexSearcher(sIndexLocation);
Hits oHitColl = searcher.Search(oParser.Parse(sSearchQuery));

Thanks folks 
-- 
View this message in context: http://www.nabble.com/Wildcard-and-Literal-Searches-combined-tp18089950p18089950.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Wildcard and Literal Searches combined

Posted by Jon Loken <Jo...@bipsolutions.com>.
Our approach is:
If the keyword is a single word, then append the star (e.g. replace ->
replace*)
If the keyword is a phrase containing one of more spaces, then it is
treated as an exact phrase (e.g. replace this -> 'replace this') 

Regards
Jon 

-----Original Message-----
From: mick l [mailto:kali.mist@gmail.com] 
Sent: 24 June 2008 14:22
To: java-user@lucene.apache.org
Subject: RE: Wildcard and Literal Searches combined


Cheers,
Out of interest, how did you go about it in the end?


jloken wrote:
> 
> Hi,
> 
> I posed a similar question on 09 May 08. 
> 
> The response was as below. I did not go down this route however, as a 
> wild carded 'exact' phrase is in a way contradictory.
> Regards
> Jon
> 
> 
> 
> Hi,
> 
> Here's a searchable mailing list archive: 
> http://www.gossamer-threads.com/lists/lucene/java-user/
> 
> As regards the wildcard phrase queries, here's one way I think you 
> could do it, but it's a bit of extra work. If you're using 
> QueryParser, you'd have to override the "getFieldQuery" method to use 
> span queries instead of phrase queries.
> 
> A phrase query can be implemented as a span query with a span or
"slop"
> 
> factor of 1. So, once you have the PhraseQuery object, you would:
> 
> 1. Extract the terms
> 2. For each one, check if it contains a "*" or a "?"
> 3. If it does, create a WildcardQuery using that term, and re-write it

> using IndexReader.rewrite method. This expands the wildcard query into

> all it's matches.
> 4. Create an array of SpanTermQuery objects, (one SpanTermQuery for 
> each term that matched you wildcard); then add that array to a
SpanOrQuery.
> 5. Repeat 2 to 4 for each wildcard term in the phrase.
> 6. Finally (!), create a SpanNearQuery, adding all the original terms 
> in order, but substituting your SpanOrQuerys for the wildcard terms. 
> Use a slop of 1, and set the inOrder flag to true.
> 
> So, essentially, you'd end up with: (you'll have to excuse me if I 
> haven't rendered the span queries correctly as strings here - but this

> should give the general idea...)
> 
> spanNear[boiler (spanOr[replacement replacing])]
> 
> So it will accept *either* "replacement" or "replacing" adjacent to 
> "boiler", which is what you want.
> 
> As you can see, it's a bit of work - but if you add this functionality

> to the QueryParser, you'll can re-use it a lot!
> 
> Hope that helps!
> 
> -JB
> 
> 
> 
> 
>  
> 
> -----Original Message-----
> From: mick l [mailto:kali.mist@gmail.com]
> Sent: 24 June 2008 13:29
> To: java-user@lucene.apache.org
> Subject: Wildcard and Literal Searches combined
> 
> 
> Folks,
> My users require wildcard searches. Sometimes their search phrases 
> contain spaces. I am having trouble trying to implement a wildcard 
> search on strings containing spaces, so if the term includes spaces I 
> force a literal search by adding double quotes to the search term.
> So the search string for 'Dublin' becomes search term (Dublin*) 
> whereas search string 'Dublin City' becomes ("Dublin City")
> 
> 
> If I use (Dublin City*) I get all instances of Dublin OR City in the 
> results which is not what I am looking for.
> 
> Is there any way I can combine the wildcard search and the literal?
> 
> Heres my existing code. Its in c# with Lucene.Net
> 
> //if input has spaces we do a literal search if
(sSearchQuery.IndexOf("
> ") < 0) { sSearchQuery = "(" + sSearchQuery + "*)"; } else { 
> sSearchQuery = "(\"" + sSearchQuery + "\")"; } IndexSearcher searcher 
> = new IndexSearcher(sIndexLocation); Hits oHitColl = 
> searcher.Search(oParser.Parse(sSearchQuery));
> 
> Thanks folks
> --
> View this message in context:
> http://www.nabble.com/Wildcard-and-Literal-Searches-combined-tp1808995
> 0p
> 18089950.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 


BiP Solutions Limited is a company registered in Scotland with Company Number SC086146 and VAT number 383030966 and having its registered office at Park House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ ****************************************************************************
This e-mail (and any attachment) is intended only for the attention of the addressee(s). Its unauthorised use, disclosure, storage or copying is not permitted. If you are not the intended recipient, please destroyall copies and inform the sender by return e-mail.
This e-mail (whether you are the sender or the recipient) may be monitored, recorded and retained by BiP Solutions Ltd.
E-mail monitoring/ blocking software may be used, and e-mail content may be read at any time. You have a responsibility to ensure laws are not broken when composing or forwarding e-mails and their contents.
****************************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Wildcard and Literal Searches combined

Posted by mick l <ka...@gmail.com>.
Cheers,
Out of interest, how did you go about it in the end?


jloken wrote:
> 
> Hi, 
> 
> I posed a similar question on 09 May 08. 
> 
> The response was as below. I did not go down this route however, as a
> wild carded 'exact' phrase is in a way contradictory. 
> Regards
> Jon
> 
> 
> 
> Hi,
> 
> Here's a searchable mailing list archive: 
> http://www.gossamer-threads.com/lists/lucene/java-user/
> 
> As regards the wildcard phrase queries, here's one way I think you could
> do it, but it's a bit of extra work. If you're using QueryParser, you'd
> have to override the "getFieldQuery" method to use span queries instead
> of phrase queries.
> 
> A phrase query can be implemented as a span query with a span or "slop"
> 
> factor of 1. So, once you have the PhraseQuery object, you would:
> 
> 1. Extract the terms
> 2. For each one, check if it contains a "*" or a "?"
> 3. If it does, create a WildcardQuery using that term, and re-write it
> using IndexReader.rewrite method. This expands the wildcard query into
> all it's matches.
> 4. Create an array of SpanTermQuery objects, (one SpanTermQuery for each
> term that matched you wildcard); then add that array to a SpanOrQuery.
> 5. Repeat 2 to 4 for each wildcard term in the phrase.
> 6. Finally (!), create a SpanNearQuery, adding all the original terms in
> order, but substituting your SpanOrQuerys for the wildcard terms. Use a
> slop of 1, and set the inOrder flag to true.
> 
> So, essentially, you'd end up with: (you'll have to excuse me if I
> haven't rendered the span queries correctly as strings here - but this
> should give the general idea...)
> 
> spanNear[boiler (spanOr[replacement replacing])]
> 
> So it will accept *either* "replacement" or "replacing" adjacent to
> "boiler", which is what you want.
> 
> As you can see, it's a bit of work - but if you add this functionality
> to the QueryParser, you'll can re-use it a lot!
> 
> Hope that helps!
> 
> -JB
> 
> 
> 
> 
>  
> 
> -----Original Message-----
> From: mick l [mailto:kali.mist@gmail.com] 
> Sent: 24 June 2008 13:29
> To: java-user@lucene.apache.org
> Subject: Wildcard and Literal Searches combined
> 
> 
> Folks,
> My users require wildcard searches. Sometimes their search phrases
> contain spaces. I am having trouble trying to implement a wildcard
> search on strings containing spaces, so if the term includes spaces I
> force a literal search by adding double quotes to the search term.
> So the search string for 'Dublin' becomes search term (Dublin*) whereas
> search string 'Dublin City' becomes ("Dublin City")
> 
> 
> If I use (Dublin City*) I get all instances of Dublin OR City in the
> results which is not what I am looking for. 
> 
> Is there any way I can combine the wildcard search and the literal?
> 
> Heres my existing code. Its in c# with Lucene.Net
> 
> //if input has spaces we do a literal search if (sSearchQuery.IndexOf("
> ") < 0) { sSearchQuery = "(" + sSearchQuery + "*)"; } else {
> sSearchQuery = "(\"" + sSearchQuery + "\")"; } IndexSearcher searcher =
> new IndexSearcher(sIndexLocation); Hits oHitColl =
> searcher.Search(oParser.Parse(sSearchQuery));
> 
> Thanks folks
> --
> View this message in context:
> http://www.nabble.com/Wildcard-and-Literal-Searches-combined-tp18089950p
> 18089950.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> ______________________________________________        
> This email has been scanned by Netintelligence        
> http://www.netintelligence.com/email
> 
> 
> 
> BiP Solutions Limited is a company registered in Scotland with Company
> Number SC086146 and VAT number 383030966 and having its registered office
> at Park House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ
> ****************************************************************************
> This e-mail (and any attachment) is intended only for the attention of the
> addressee(s). Its unauthorised use, disclosure, storage or copying is not
> permitted. If you are not the intended recipient, please destroyall copies
> and inform the sender by return e-mail.
> This e-mail (whether you are the sender or the recipient) may be
> monitored, recorded and retained by BiP Solutions Ltd.
> E-mail monitoring/ blocking software may be used, and e-mail content may
> be read at any time. You have a responsibility to ensure laws are not
> broken when composing or forwarding e-mails and their contents.
> ****************************************************************************
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Wildcard-and-Literal-Searches-combined-tp18089950p18090884.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


RE: Wildcard and Literal Searches combined

Posted by Jon Loken <Jo...@bipsolutions.com>.
Hi, 

I posed a similar question on 09 May 08. 

The response was as below. I did not go down this route however, as a
wild carded 'exact' phrase is in a way contradictory. 
Regards
Jon



Hi,

Here's a searchable mailing list archive: 
http://www.gossamer-threads.com/lists/lucene/java-user/

As regards the wildcard phrase queries, here's one way I think you could
do it, but it's a bit of extra work. If you're using QueryParser, you'd
have to override the "getFieldQuery" method to use span queries instead
of phrase queries.

A phrase query can be implemented as a span query with a span or "slop"

factor of 1. So, once you have the PhraseQuery object, you would:

1. Extract the terms
2. For each one, check if it contains a "*" or a "?"
3. If it does, create a WildcardQuery using that term, and re-write it
using IndexReader.rewrite method. This expands the wildcard query into
all it's matches.
4. Create an array of SpanTermQuery objects, (one SpanTermQuery for each
term that matched you wildcard); then add that array to a SpanOrQuery.
5. Repeat 2 to 4 for each wildcard term in the phrase.
6. Finally (!), create a SpanNearQuery, adding all the original terms in
order, but substituting your SpanOrQuerys for the wildcard terms. Use a
slop of 1, and set the inOrder flag to true.

So, essentially, you'd end up with: (you'll have to excuse me if I
haven't rendered the span queries correctly as strings here - but this
should give the general idea...)

spanNear[boiler (spanOr[replacement replacing])]

So it will accept *either* "replacement" or "replacing" adjacent to
"boiler", which is what you want.

As you can see, it's a bit of work - but if you add this functionality
to the QueryParser, you'll can re-use it a lot!

Hope that helps!

-JB




 

-----Original Message-----
From: mick l [mailto:kali.mist@gmail.com] 
Sent: 24 June 2008 13:29
To: java-user@lucene.apache.org
Subject: Wildcard and Literal Searches combined


Folks,
My users require wildcard searches. Sometimes their search phrases
contain spaces. I am having trouble trying to implement a wildcard
search on strings containing spaces, so if the term includes spaces I
force a literal search by adding double quotes to the search term.
So the search string for 'Dublin' becomes search term (Dublin*) whereas
search string 'Dublin City' becomes ("Dublin City")


If I use (Dublin City*) I get all instances of Dublin OR City in the
results which is not what I am looking for. 

Is there any way I can combine the wildcard search and the literal?

Heres my existing code. Its in c# with Lucene.Net

//if input has spaces we do a literal search if (sSearchQuery.IndexOf("
") < 0) { sSearchQuery = "(" + sSearchQuery + "*)"; } else {
sSearchQuery = "(\"" + sSearchQuery + "\")"; } IndexSearcher searcher =
new IndexSearcher(sIndexLocation); Hits oHitColl =
searcher.Search(oParser.Parse(sSearchQuery));

Thanks folks
--
View this message in context:
http://www.nabble.com/Wildcard-and-Literal-Searches-combined-tp18089950p
18089950.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


______________________________________________        
This email has been scanned by Netintelligence        
http://www.netintelligence.com/email



BiP Solutions Limited is a company registered in Scotland with Company Number SC086146 and VAT number 383030966 and having its registered office at Park House, 300 Glasgow Road, Shawfield, Glasgow, G73 1SQ ****************************************************************************
This e-mail (and any attachment) is intended only for the attention of the addressee(s). Its unauthorised use, disclosure, storage or copying is not permitted. If you are not the intended recipient, please destroyall copies and inform the sender by return e-mail.
This e-mail (whether you are the sender or the recipient) may be monitored, recorded and retained by BiP Solutions Ltd.
E-mail monitoring/ blocking software may be used, and e-mail content may be read at any time. You have a responsibility to ensure laws are not broken when composing or forwarding e-mails and their contents.
****************************************************************************

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Wildcard and Literal Searches combined

Posted by Chris Hostetter <ho...@fucit.org>.
: My users require wildcard searches. Sometimes their search phrases contain
: spaces. I am having trouble trying to implement a wildcard search on strings
: containing spaces, so if the term includes spaces I force a literal search
: by adding double quotes to the search term.
: So the search string for 'Dublin' becomes search term (Dublin*)
: whereas search string 'Dublin City' becomes ("Dublin City")

: If I use (Dublin City*) I get all instances of Dublin OR City in the results
: which is not what I am looking for. 

my naive understanding of your problem makes me think you should 
completely avoid using the QueryParser.  Index this field as "UNTOKENIZED" 
(or using an Analyzer that does no tokenization, but perhaps lowercases if 
that's what you want) and at query time construct a WildCardQuery directly 
off of the users input (lowercases first if that what you did at index 
time)

...but i could be missunderstanding your goal.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Wildcard and Literal Searches combined

Posted by Erick Erickson <er...@gmail.com>.
Do you require that the words be right next to each other? You can,
of course, set your default to AND (it's OR unless you change it
explicitly). That'll give you documents that have both Dublin and City.
You don't need wildcards at all in this case.

If you require exact matches, you can use PhraseQuery or SpanQuery
to get words right next to each other. That is, "dublin city" would
be found but not "dublin is a very large city". You can set slop
(the distance between terms that counts as a hit) variously, and you
can also specify whether order is important. Again, you don't need
wildcards at all (and shouldn't use them unless you really need to).

Beware that phrase queries do NOT go through wildcard expansion. That
is, "Dubli* city" (as a PhraseQuery) doesn't do what you might think.

I'd really look over the query syntax carefully for more insights. See
http://lucene.apache.org/java/docs/queryparsersyntax.html

Also note that SpanQueries are something you construct yourself, you
don't send text through the query parser and magically get SpanQueries
back.

Best
Erick

On Tue, Jun 24, 2008 at 8:28 AM, mick l <ka...@gmail.com> wrote:

>
> Folks,
> My users require wildcard searches. Sometimes their search phrases contain
> spaces. I am having trouble trying to implement a wildcard search on
> strings
> containing spaces, so if the term includes spaces I force a literal search
> by adding double quotes to the search term.
> So the search string for 'Dublin' becomes search term (Dublin*)
> whereas search string 'Dublin City' becomes ("Dublin City")
>
>
> If I use (Dublin City*) I get all instances of Dublin OR City in the
> results
> which is not what I am looking for.
>
> Is there any way I can combine the wildcard search and the literal?
>
> Heres my existing code. Its in c# with Lucene.Net
>
> //if input has spaces we do a literal search
> if (sSearchQuery.IndexOf(" ") < 0)
> {
> sSearchQuery = "(" + sSearchQuery + "*)";
> }
> else
> {
> sSearchQuery = "(\"" + sSearchQuery + "\")";
> }
> IndexSearcher searcher = new IndexSearcher(sIndexLocation);
> Hits oHitColl = searcher.Search(oParser.Parse(sSearchQuery));
>
> Thanks folks
> --
> View this message in context:
> http://www.nabble.com/Wildcard-and-Literal-Searches-combined-tp18089950p18089950.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>