You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Ross Simpson <si...@gmail.com> on 2013/05/22 02:08:32 UTC

Query with phrases, wildcards and fuzziness

Hi all,

I'm trying to create a fairly complex query, and having trouble constructing it.

My index contains a TextField with place names as strings, e.g.:
	Port Melbourne, VIC 3207

I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so that my strings are not tokenized at all.

I want to support end-user searches like the following, and have them match that string above:
	Port Melbourne, VIC 3207 (exact)
	Port (prefix)
	Port Mel (prefix, including a space)
	Melbo (wildcard)
	Melburne (fuzzy)

I'm trying to get away with not parsing the query myself, and just constructing something like this:
	parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) );

That doesn't seem to work, neither with QueryParser nor with ComplexPhraseQueryParser.  Specifically, I'm having trouble getting appropriate results when there's a space in the input string, notable with the wildcard match part (it ends up returning everything in the index).

Is my approach above possible?  I also have had a look at using specific Query implementations and combining them in a BooleanQuery, but I'm not quite sure how to replicate the "OR" behavior I want (from reading, Occur.SHOULD is not equivalent or "OR").


Any suggestions would be appreciated.

Thanks!
Ross




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query with phrases, wildcards and fuzziness

Posted by Jack Krupansky <ja...@basetechnology.com>.
Using BooleanQuery and Should is the way to go. There are some nuances, but 
you may not run into them. Sometimes it is more that the query parser syntax 
is the issue rather than the Lucene BQ itself. For example, with a string of 
AND and OR, they all get parsed into a single BQ, which is clearly not 
traditional "Boolean", but if you code multiple BQs that nest (or fully 
parenthesize your source query), you will get a true "Boolean" query. It's 
up to your particular application whether you need "true" Boolean or not.

-- Jack Krupansky

-----Original Message----- 
From: Ross Simpson
Sent: Wednesday, May 22, 2013 7:44 AM
To: java-user@lucene.apache.org
Subject: Re: Query with phrases, wildcards and fuzziness

One further question:

If I wanted to construct my query using Query implementations instead of
a QueryParser (e.g. TermQuery, WildcardQuery, etc.), what's the right
way to duplicate the "OR" functionality I wrote about below?  As I
mentioned, I've read that wrapping query objects in a BooleanQuery and
using Occur.SHOULD is not necessarily the same.

Any suggestions?

Ross


On 22/05/2013 11:46 AM, Ross Simpson wrote:
> Jack, thanks very much!  I wasn't considering a space a special character 
> for some reason.  That has worked perfectly.
>
> Cheers,
> Ross
>
>
> On May 22, 2013, at 10:24 AM, Jack Krupansky wrote:
>
>> Just escape embedded spaces with a backslash.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Ross Simpson
>> Sent: Tuesday, May 21, 2013 8:08 PM
>> To: java-user@lucene.apache.org
>> Subject: Query with phrases, wildcards and fuzziness
>>
>> Hi all,
>>
>> I'm trying to create a fairly complex query, and having trouble 
>> constructing it.
>>
>> My index contains a TextField with place names as strings, e.g.:
>> Port Melbourne, VIC 3207
>>
>> I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so 
>> that my strings are not tokenized at all.
>>
>> I want to support end-user searches like the following, and have them 
>> match that string above:
>> Port Melbourne, VIC 3207 (exact)
>> Port (prefix)
>> Port Mel (prefix, including a space)
>> Melbo (wildcard)
>> Melburne (fuzzy)
>>
>> I'm trying to get away with not parsing the query myself, and just 
>> constructing something like this:
>> parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR 
>> (STRING~1^3) );
>>
>> That doesn't seem to work, neither with QueryParser nor with 
>> ComplexPhraseQueryParser.  Specifically, I'm having trouble getting 
>> appropriate results when there's a space in the input string, notable 
>> with the wildcard match part (it ends up returning everything in the 
>> index).
>>
>> Is my approach above possible?  I also have had a look at using specific 
>> Query implementations and combining them in a BooleanQuery, but I'm not 
>> quite sure how to replicate the "OR" behavior I want (from reading, 
>> Occur.SHOULD is not equivalent or "OR").
>>
>>
>> Any suggestions would be appreciated.
>>
>> Thanks!
>> Ross
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query with phrases, wildcards and fuzziness

Posted by Ross Simpson <si...@gmail.com>.
One further question:

If I wanted to construct my query using Query implementations instead of 
a QueryParser (e.g. TermQuery, WildcardQuery, etc.), what's the right 
way to duplicate the "OR" functionality I wrote about below?  As I 
mentioned, I've read that wrapping query objects in a BooleanQuery and 
using Occur.SHOULD is not necessarily the same.

Any suggestions?

Ross


On 22/05/2013 11:46 AM, Ross Simpson wrote:
> Jack, thanks very much!  I wasn't considering a space a special character for some reason.  That has worked perfectly.
>
> Cheers,
> Ross
>
>
> On May 22, 2013, at 10:24 AM, Jack Krupansky wrote:
>
>> Just escape embedded spaces with a backslash.
>>
>> -- Jack Krupansky
>>
>> -----Original Message----- From: Ross Simpson
>> Sent: Tuesday, May 21, 2013 8:08 PM
>> To: java-user@lucene.apache.org
>> Subject: Query with phrases, wildcards and fuzziness
>>
>> Hi all,
>>
>> I'm trying to create a fairly complex query, and having trouble constructing it.
>>
>> My index contains a TextField with place names as strings, e.g.:
>> Port Melbourne, VIC 3207
>>
>> I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so that my strings are not tokenized at all.
>>
>> I want to support end-user searches like the following, and have them match that string above:
>> Port Melbourne, VIC 3207 (exact)
>> Port (prefix)
>> Port Mel (prefix, including a space)
>> Melbo (wildcard)
>> Melburne (fuzzy)
>>
>> I'm trying to get away with not parsing the query myself, and just constructing something like this:
>> parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) );
>>
>> That doesn't seem to work, neither with QueryParser nor with ComplexPhraseQueryParser.  Specifically, I'm having trouble getting appropriate results when there's a space in the input string, notable with the wildcard match part (it ends up returning everything in the index).
>>
>> Is my approach above possible?  I also have had a look at using specific Query implementations and combining them in a BooleanQuery, but I'm not quite sure how to replicate the "OR" behavior I want (from reading, Occur.SHOULD is not equivalent or "OR").
>>
>>
>> Any suggestions would be appreciated.
>>
>> Thanks!
>> Ross
>>
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query with phrases, wildcards and fuzziness

Posted by Ross Simpson <si...@gmail.com>.
Jack, thanks very much!  I wasn't considering a space a special character for some reason.  That has worked perfectly.

Cheers,
Ross


On May 22, 2013, at 10:24 AM, Jack Krupansky wrote:

> Just escape embedded spaces with a backslash.
> 
> -- Jack Krupansky
> 
> -----Original Message----- From: Ross Simpson
> Sent: Tuesday, May 21, 2013 8:08 PM
> To: java-user@lucene.apache.org
> Subject: Query with phrases, wildcards and fuzziness
> 
> Hi all,
> 
> I'm trying to create a fairly complex query, and having trouble constructing it.
> 
> My index contains a TextField with place names as strings, e.g.:
> Port Melbourne, VIC 3207
> 
> I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so that my strings are not tokenized at all.
> 
> I want to support end-user searches like the following, and have them match that string above:
> Port Melbourne, VIC 3207 (exact)
> Port (prefix)
> Port Mel (prefix, including a space)
> Melbo (wildcard)
> Melburne (fuzzy)
> 
> I'm trying to get away with not parsing the query myself, and just constructing something like this:
> parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) );
> 
> That doesn't seem to work, neither with QueryParser nor with ComplexPhraseQueryParser.  Specifically, I'm having trouble getting appropriate results when there's a space in the input string, notable with the wildcard match part (it ends up returning everything in the index).
> 
> Is my approach above possible?  I also have had a look at using specific Query implementations and combining them in a BooleanQuery, but I'm not quite sure how to replicate the "OR" behavior I want (from reading, Occur.SHOULD is not equivalent or "OR").
> 
> 
> Any suggestions would be appreciated.
> 
> Thanks!
> Ross
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Query with phrases, wildcards and fuzziness

Posted by Jack Krupansky <ja...@basetechnology.com>.
Just escape embedded spaces with a backslash.

-- Jack Krupansky

-----Original Message----- 
From: Ross Simpson
Sent: Tuesday, May 21, 2013 8:08 PM
To: java-user@lucene.apache.org
Subject: Query with phrases, wildcards and fuzziness

Hi all,

I'm trying to create a fairly complex query, and having trouble constructing 
it.

My index contains a TextField with place names as strings, e.g.:
Port Melbourne, VIC 3207

I'm using an analyzer with just KeywordTokenizer and LowerCaseFilter, so 
that my strings are not tokenized at all.

I want to support end-user searches like the following, and have them match 
that string above:
Port Melbourne, VIC 3207 (exact)
Port (prefix)
Port Mel (prefix, including a space)
Melbo (wildcard)
Melburne (fuzzy)

I'm trying to get away with not parsing the query myself, and just 
constructing something like this:
parser.parse( "(STRING^9) OR (STRING*^7) OR (*STRING*^5) OR (STRING~1^3) );

That doesn't seem to work, neither with QueryParser nor with 
ComplexPhraseQueryParser.  Specifically, I'm having trouble getting 
appropriate results when there's a space in the input string, notable with 
the wildcard match part (it ends up returning everything in the index).

Is my approach above possible?  I also have had a look at using specific 
Query implementations and combining them in a BooleanQuery, but I'm not 
quite sure how to replicate the "OR" behavior I want (from reading, 
Occur.SHOULD is not equivalent or "OR").


Any suggestions would be appreciated.

Thanks!
Ross




---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org