You are viewing a plain text version of this content. The canonical link for it is here.
Posted to user@lucenenet.apache.org by "Granroth, Neal V." <ne...@thermofisher.com> on 2007/08/31 19:58:21 UTC

RE: using multiple wildcards in a term?

Douglas,

Acceptable performance is a subjective thing.

I am currently running tests with an index of 140005 "documents", and 507027 terms.

A three field, boolean search, using a single term finds 12063 hits in 0.047 seconds.

A three field, boolean search, using a single wildcard term (*word) finds 923 hits in 0.375 seconds.

That's slower by nearly a factor of 10. Significant yes, but still much faster than my test UI can display them, and fast enough that supporting wildcard queries is useful thing to do.

Looking at the source (version 1.9.1) for "WildcardQuery" and the class it uses to process the query "WildcardTermEnum"; it does not appear to support multiple asterisk wildcards.

However, you could probably compose a boolean query joining two WildcardQueries to achieve the that result.


-- Neal


-----Original Message-----
From: Douglas Smith (DataSmithy) [mailto:datasmithy@googlemail.com]
Sent: Friday, August 31, 2007 9:43 AM
To: lucene-net-user@incubator.apache.org
Subject: Re: using mutliple wildcards in a term?

Hi Michael,

FYI, with version 2.1, I am using wildcards with the standard query
parser, and it seems to be working the way I expect.  That is, if I put
wildcards at the beginning *or* end or a word (prefix or suffix word
part), I get different result counts compared to a word without any
wildcards.

However, I was not able to get wildcards to work with the WildcardQuery
function searching on a single term (it returned no results).  It is
possible I may have not been using it correctly, since it was my first try.

Also, my index is apparently small enough that I don't get a significant
performance hit from using wildcards at the beginning of a term.

/*Does anybody know if Lucene supports wildcards at the beginning *and*
end of a term at the same time?  I am getting no results when I do this.  */

Also from an interface design point of view, if Lucene does not support
this, could it be argued that it should throw an error in this case,
instead of returning no results?

Michael Mitiaguin wrote:
> Douglas,
>
> I never used it , but  in "Lucene in Action" book we may read :
> Wildcards at the beginning of a term are prohibited using QueryParser, but
> an API-coded WildcardQuery may use leading wildcards (at the expense of
> performance).
>
> Regards
> Michael
>
> On 8/31/07, Douglas Smith <do...@aciwebs.com> wrote:
>
>> Hi everyone,
>>
>> Are wildcard queries intended to be able to support wildcards at the
>> beginning *and* end of a term?
>>
>> I am getting search results when I use a single wildcard (*), but not
>> when I use them at the begging *and* end of a word.  The Lucene java
>> documentation seems unclear on this point, but one of my requirements is
>> to find word fragments in the middle of words.
>>
>>
>> =====================================
>> Douglas M. Smith
>> =====================================
>> Email: douglas.smith@aciwebs.com
>> Yahoo: datasmithy@yahoo.com
>> Jabber: datasmithy@jabber.parcellsharp.net
>> =====================================
>>
>> "For years there has been a theory that millions of monkeys typing at
>> random on millions of typewriters would reproduce the entire works of
>> Shakespeare. The Internet has proven this theory to be untrue."  -
>> Unknown
>>
>>
>>
>>
>
>

--
======================================
Douglas M. Smith
|--- DataSmithy ---|

email: DataSmithy@gmail.com
work: 540-322-2204
home:  540-381-8939
fax:   866-330-9401
aim: datasmithy
yahoo: datasmithy
skype: datasmitty
jabber: datasmithy@jabber.parcellsharp.net
======================================



Re: using multiple wildcards in a term?

Posted by "Douglas Smith (DataSmithy)" <da...@googlemail.com>.
Hi Neal,

Thanks for the thoughts.

I was planning on doing a boolean search if needed (*myword OR myword*) 
, but that will still not find word fragments in the middle of words 
(for search words that are neither suffixes nor prefixes).  It does not 
look like Lucene (or many full text search engines in general) meet that 
requirement.  I suppose it is a trade off of features vs. performance.  
I am assuming it is generally too expensive of an operation to perform 
for full text engines (that generally index very large amounts of text 
data) to include as a useful feature.

Granroth, Neal V. wrote:
> Douglas,
>
> Acceptable performance is a subjective thing.
>
> I am currently running tests with an index of 140005 "documents", and 507027 terms.
>
> A three field, boolean search, using a single term finds 12063 hits in 0.047 seconds.
>
> A three field, boolean search, using a single wildcard term (*word) finds 923 hits in 0.375 seconds.
>
> That's slower by nearly a factor of 10. Significant yes, but still much faster than my test UI can display them, and fast enough that supporting wildcard queries is useful thing to do.
>
> Looking at the source (version 1.9.1) for "WildcardQuery" and the class it uses to process the query "WildcardTermEnum"; it does not appear to support multiple asterisk wildcards.
>
> However, you could probably compose a boolean query joining two WildcardQueries to achieve the that result.
>
>
> -- Neal
>
>
> -----Original Message-----
> From: Douglas Smith (DataSmithy) [mailto:datasmithy@googlemail.com]
> Sent: Friday, August 31, 2007 9:43 AM
> To: lucene-net-user@incubator.apache.org
> Subject: Re: using mutliple wildcards in a term?
>
> Hi Michael,
>
> FYI, with version 2.1, I am using wildcards with the standard query
> parser, and it seems to be working the way I expect.  That is, if I put
> wildcards at the beginning *or* end or a word (prefix or suffix word
> part), I get different result counts compared to a word without any
> wildcards.
>
> However, I was not able to get wildcards to work with the WildcardQuery
> function searching on a single term (it returned no results).  It is
> possible I may have not been using it correctly, since it was my first try.
>
> Also, my index is apparently small enough that I don't get a significant
> performance hit from using wildcards at the beginning of a term.
>
> /*Does anybody know if Lucene supports wildcards at the beginning *and*
> end of a term at the same time?  I am getting no results when I do this.  */
>
> Also from an interface design point of view, if Lucene does not support
> this, could it be argued that it should throw an error in this case,
> instead of returning no results?
>
> Michael Mitiaguin wrote:
>   
>> Douglas,
>>
>> I never used it , but  in "Lucene in Action" book we may read :
>> Wildcards at the beginning of a term are prohibited using QueryParser, but
>> an API-coded WildcardQuery may use leading wildcards (at the expense of
>> performance).
>>
>> Regards
>> Michael
>>
>> On 8/31/07, Douglas Smith <do...@aciwebs.com> wrote:
>>
>>     
>>> Hi everyone,
>>>
>>> Are wildcard queries intended to be able to support wildcards at the
>>> beginning *and* end of a term?
>>>
>>> I am getting search results when I use a single wildcard (*), but not
>>> when I use them at the begging *and* end of a word.  The Lucene java
>>> documentation seems unclear on this point, but one of my requirements is
>>> to find word fragments in the middle of words.
>>>
>>>
>>> =====================================
>>> Douglas M. Smith
>>> =====================================
>>> Email: douglas.smith@aciwebs.com
>>> Yahoo: datasmithy@yahoo.com
>>> Jabber: datasmithy@jabber.parcellsharp.net
>>> =====================================
>>>
>>> "For years there has been a theory that millions of monkeys typing at
>>> random on millions of typewriters would reproduce the entire works of
>>> Shakespeare. The Internet has proven this theory to be untrue."  -
>>> Unknown
>>>
>>>
>>>
>>>
>>>       
>>     
>
> --
> ======================================
> Douglas M. Smith
> |--- DataSmithy ---|
>
> email: DataSmithy@gmail.com
> work: 540-322-2204
> home:  540-381-8939
> fax:   866-330-9401
> aim: datasmithy
> yahoo: datasmithy
> skype: datasmitty
> jabber: datasmithy@jabber.parcellsharp.net
> ======================================
>
>
>
>   

-- 
======================================
Douglas M. Smith
|--- DataSmithy ---|

email: DataSmithy@gmail.com
work: 540-322-2204
home:  540-381-8939
fax:   866-330-9401
aim: datasmithy
yahoo: datasmithy
skype: datasmitty
jabber: datasmithy@jabber.parcellsharp.net
======================================