You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Ma...@ibsbe.be on 2007/04/19 08:35:05 UTC

Leading wildcards

hi,

we have been trying to get the leading wildcards to work.

we have been looking around the Solr website, the Lucene website, wiki's 
and the mailing lists etc ...
but we found a lot of contradictory information.

so we have a few question : 
- is the latest version of lucene capable of handling leading wildcards ? 
- is the latest version of solr capable of handling leading wildcards ?
- do we need to make adjustments to the solr source code ?
- if we need to adjust the solr source, what do we need to change ?

thanks in advance !
Maarten

Re: Leading wildcards

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 19, 2007, at 11:37 AM, Michael Kimsal wrote:
> It's not that I don't *want* to contribute, but hardly have enough  
> time to
> get the basics
> done some days.

You can rest assured that all of us here are in that same boat.  :)

And you can also rest assured that the switch your asking for will be  
part of Solr in the near future one way or another.  I just like to  
encourage folks that can hack quick and dirty changes to go a little  
bit further and learn the Solr unit testing framework (currently a  
bit more complex than we can make it, I'm sure) and what it takes to  
get a change from hack all the way into the core codebase with wiki  
documentation.  It's easier than most folks think to go the extra  
bit, and helping folks learn how to fish is part of our jobs as well  
(and so we can sit back and relax while all the young whippersnappers  
implement our wishes just from us mentioning them! :)

	Erik


Re: Leading wildcards

Posted by Michael Kimsal <mg...@gmail.com>.
I'm in the middle of looking in to that.  For *you* ;)  it may seem like a
quick
thing to do.  For me, who's not an expert at this stuff, it's a balance
between delving in
deeply enough to figure how to do it and hitting our deadlines.

It's actually on someone else's plate here, but he's backed up with two
other projects here first.

It's not that I don't *want* to contribute, but hardly have enough time to
get the basics
done some days.

On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Apr 19, 2007, at 11:04 AM, Michael Kimsal wrote:
> > Perhaps I'm simplifying it a bit.  It would certainly help out our
> > comfort
> > level
> > to have it either be on or configurable by default, rather than
> > having to
> > maintain a
> > 'patched' version (yes, the patch is only one line, but it's the
> > principle
> > of the thing).
> > I suspect this would be the same for others.
>
> And here's where your effort could go the extra mile to help
> _yourself_ out as well as the community... instead of the one-line
> change, make it a few more lines and make it a switch from the
> configuration (like the toggle for AND/OR default operator) and even
> better round it out with a test case.  Submit it, lobby for it to be
> reviewed and applied, and step 3... profit!  :)
>
>         Erik
>
>


-- 
Michael Kimsal
http://webdevradio.com

Re: Leading wildcards

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 19, 2007, at 11:04 AM, Michael Kimsal wrote:
> Perhaps I'm simplifying it a bit.  It would certainly help out our  
> comfort
> level
> to have it either be on or configurable by default, rather than  
> having to
> maintain a
> 'patched' version (yes, the patch is only one line, but it's the  
> principle
> of the thing).
> I suspect this would be the same for others.

And here's where your effort could go the extra mile to help  
_yourself_ out as well as the community... instead of the one-line  
change, make it a few more lines and make it a switch from the  
configuration (like the toggle for AND/OR default operator) and even  
better round it out with a test case.  Submit it, lobby for it to be  
reviewed and applied, and step 3... profit!  :)

	Erik


Re: Leading wildcards

Posted by Walter Underwood <wu...@netflix.com>.
Here is a late response, apache.org was rejecting our e-mails...

Allowing leading wildcards opens up a denial of service attack. It becomes
trivial to overload the search engine and take it out of service, just
hammer it with leading wildcard queries. Please leave the default as
disabled. If we add a config option, there should be a  security warning
with it.

wunder

On 4/19/07 8:04 AM, "Michael Kimsal" <mg...@gmail.com> wrote:

> It still seems like it's only something that would be invoked by a user's
> query.
> 
> If I queried for *foobar and leading wildcards were not on in the server,
> I'd get back nothing, which isn't really correct.  I'd think the application
> should
> tell the user that that syntax isn't supported.
> 
> Perhaps I'm simplifying it a bit.  It would certainly help out our comfort
> level
> to have it either be on or configurable by default, rather than having to
> maintain a
> 'patched' version (yes, the patch is only one line, but it's the principle
> of the thing).
> I suspect this would be the same for others.
> 
> 
> 
> On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>> 
>> 
>> On Apr 19, 2007, at 10:39 AM, Yonik Seeley wrote:
>>> On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>>>> parser.setAllowLeadingWildcards(true);
>>>> 
>>>> I have also run into this issue and have intended to fix up Solr to
>>>> allow configuring that switch on QueryParser.
>>> 
>>> Any reason that parser.setAllowLeadingWildcards(true) shouldn't be
>>> the default?
>> 
>> That's fine by me.  But...
>> 
>>> Does it really need to be configurable?
>> 
>> It all depends on how bad of a hit it'd take on Solr.   What's the
>> breaking point where the performance of full-term scanning (and
>> subsequently faceting, of course) kills over or dies?   FuzzyQuery's
>> die on my 3.7M index and not-super-beefy hardware and system setup.
>> 
>>         Erik
>> 
>> 
> 


Re: Leading wildcards

Posted by Michael Kimsal <mg...@gmail.com>.
It still seems like it's only something that would be invoked by a user's
query.

If I queried for *foobar and leading wildcards were not on in the server,
I'd get back nothing, which isn't really correct.  I'd think the application
should
tell the user that that syntax isn't supported.

Perhaps I'm simplifying it a bit.  It would certainly help out our comfort
level
to have it either be on or configurable by default, rather than having to
maintain a
'patched' version (yes, the patch is only one line, but it's the principle
of the thing).
I suspect this would be the same for others.



On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Apr 19, 2007, at 10:39 AM, Yonik Seeley wrote:
> > On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
> >>> parser.setAllowLeadingWildcards(true);
> >>
> >> I have also run into this issue and have intended to fix up Solr to
> >> allow configuring that switch on QueryParser.
> >
> > Any reason that parser.setAllowLeadingWildcards(true) shouldn't be
> > the default?
>
> That's fine by me.  But...
>
> > Does it really need to be configurable?
>
> It all depends on how bad of a hit it'd take on Solr.   What's the
> breaking point where the performance of full-term scanning (and
> subsequently faceting, of course) kills over or dies?   FuzzyQuery's
> die on my 3.7M index and not-super-beefy hardware and system setup.
>
>         Erik
>
>


-- 
Michael Kimsal
http://webdevradio.com

Re: Leading wildcards

Posted by Chris Hostetter <ho...@fucit.org>.
: > ConstantScorePrefixQuery is used... there shouldn't be an issue with
: > memory, just time.
:
: Oops, except we aren't always talking about a prefix query.
: I know at least some expanding queries automatically limit to the max
: number of boolean clauses.  Not sure if all of them do though.

right ... we're talking about WildCard queries with elading wildcards ...
i can't rememebr if it uses maxBooleanClauses or not either ... but even
if it does, supporting this behavior (and runningthis risk) should be
explicitly controlled by teh user (just like changing maxBooleanClauses --
if they don't set it in solrconfig.xml, we use whatever the Lucene default
is)



-Hoss


Re: Leading wildcards

Posted by Yonik Seeley <yo...@apache.org>.
On 4/19/07, Yonik Seeley <yo...@apache.org> wrote:
> > from a stability standpoint, i would suggest that people should have to go
> > out of their way to get this behavior, since it does open up the
> > possiblity of a query OOMing Solr extremely easily.
>
> ConstantScorePrefixQuery is used... there shouldn't be an issue with
> memory, just time.

Oops, except we aren't always talking about a prefix query.
I know at least some expanding queries automatically limit to the max
number of boolean clauses.  Not sure if all of them do though.

-Yonik

Re: Leading wildcards

Posted by Chris Hostetter <ho...@fucit.org>.
: > i don't know that this is really dding a feature ... it's changing syntax.
: > "foo:*bar" has meaning by default in the query parser ... it's meaning may
: > typically result in a query that doesn't match anything,
:
: I think it's adding syntax, not changing it.
: Right now, you get an exception for foo:*bar
: So if we allowed it by default, I would call that 100% backward compatible.

Ah ... this is where my poor knowledge of wildcards comes in ... i thought
it treated * as a litteral, yeah i see your point now ... i retract that
objection, but reserve the right to stand behind the "let's not make it
easy to crash the box by default" objection.  :)


-Hoss


Re: Leading wildcards

Posted by Yonik Seeley <yo...@apache.org>.
On 4/19/07, Chris Hostetter <ho...@fucit.org> wrote:
> : For things that return results, yes.  I think that taking away
> : features isn't a good thing, but adding them can be (basically,
> : backward compatibility).
>
> i don't know that this is really dding a feature ... it's changing syntax.
> "foo:*bar" has meaning by default in the query parser ... it's meaning may
> typically result in a query that doesn't match anything,

I think it's adding syntax, not changing it.
Right now, you get an exception for foo:*bar
So if we allowed it by default, I would call that 100% backward compatible.

The one issue you brought up is memory, and that should be
investigated.   I agree we don't want to make it too easy to blow up
things up.

-Yonik

Re: Leading wildcards

Posted by Chris Hostetter <ho...@fucit.org>.
: For things that return results, yes.  I think that taking away
: features isn't a good thing, but adding them can be (basically,
: backward compatibility).

i don't know that this is really dding a feature ... it's changing syntax.
"foo:*bar" has meaning by default in the query parser ... it's meaning may
typically result in a query that doesn't match anything, but that's an
expectation people may have based on past use of QueryParser (or reading
of it's docs)

in this point, i'm just saying we should change any default meaning of
syntax ... adding _val_:"func(foo)" didn't really run any risk of doing
somethign people didn't expect (unless they have a field named_val_) ...
but people who are use to QUeryParser protecting them from foolish users
that type in leading wildcards would be in for a nasty suprise if we
change the default.



-Hoss


Re: Leading wildcards

Posted by Yonik Seeley <yo...@apache.org>.
On 4/19/07, Chris Hostetter <ho...@fucit.org> wrote:
>
> : > Any reason that parser.setAllowLeadingWildcards(true) shouldn't be
> : > the default?
>
> i'm of two minds on this, both of which vote "don't do it"
>
> from a predictibility standpoint, i think we should keep the default
> beahvior the same as the base QueryParsers default behavior as much as
> possible

For things that return results, yes.  I think that taking away
features isn't a good thing, but adding them can be (basically,
backward compatibility).

> from a stability standpoint, i would suggest that people should have to go
> out of their way to get this behavior, since it does open up the
> possiblity of a query OOMing Solr extremely easily.

ConstantScorePrefixQuery is used... there shouldn't be an issue with
memory, just time.

> In general: if we are going to change the behavior of existing syntax in
> QP, it should be in ways that make the system more stable (ala:
> ConstantScore Range and Prefix queries) and not less.

One could argue producing a result rather than throwing an exception
is an improvement.

-Yonik

Re: Leading wildcards

Posted by Chris Hostetter <ho...@fucit.org>.
: > Any reason that parser.setAllowLeadingWildcards(true) shouldn't be
: > the default?

i'm of two minds on this, both of which vote "don't do it"

from a predictibility standpoint, i think we should keep the default
beahvior the same as the base QueryParsers default behavior as much as
possible

from a stability standpoint, i would suggest that people should have to go
out of their way to get this behavior, since it does open up the
possiblity of a query OOMing Solr extremely easily.

In general: if we are going to change the behavior of existing syntax in
QP, it should be in ways that make the system more stable (ala:
ConstantScore Range and Prefix queries) and not less.



-Hoss


Re: Leading wildcards

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 19, 2007, at 10:39 AM, Yonik Seeley wrote:
> On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>> parser.setAllowLeadingWildcards(true);
>>
>> I have also run into this issue and have intended to fix up Solr to
>> allow configuring that switch on QueryParser.
>
> Any reason that parser.setAllowLeadingWildcards(true) shouldn't be  
> the default?

That's fine by me.  But...

> Does it really need to be configurable?

It all depends on how bad of a hit it'd take on Solr.   What's the  
breaking point where the performance of full-term scanning (and  
subsequently faceting, of course) kills over or dies?   FuzzyQuery's  
die on my 3.7M index and not-super-beefy hardware and system setup.

	Erik


Re: Leading wildcards

Posted by Yonik Seeley <yo...@apache.org>.
On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>>parser.setAllowLeadingWildcards(true);
>
> I have also run into this issue and have intended to fix up Solr to
> allow configuring that switch on QueryParser.

Any reason that parser.setAllowLeadingWildcards(true) shouldn't be the default?
Does it really need to be configurable?

-Yonik

Re: Leading wildcards

Posted by Michael Kimsal <mg...@gmail.com>.
Agreed, but in our tests (100M index) it wasn't a performance hit, and much
better (as in it actually worked) than MSSQL  ;)



On 4/19/07, Erik Hatcher <er...@ehatchersolutions.com> wrote:
>
>
> On Apr 19, 2007, at 6:56 AM, Michael Kimsal wrote:
> > It's bugged us a little bit, because it's something that we need
> > (to be able to emulate the previous foo LIKE '%bar%' SQL behaviour
> > we're
> > replacing), but can't offer our users yet.
>
> I have also run into this issue and have intended to fix up Solr to
> allow configuring that switch on QueryParser.  I'll eventually get to
> this, but someone supply a patch with a test case would get it done
> sooner.
>
> I must, however, caveat discussion of leading wildcards with the
> underlying effect you get.  If you use standard analysis and perform
> a leading wildcard query, you incur a (possibly) dramatic hit in
> terms of performance.  Lucene has to scan *every* term in the
> specified field.  In fact, with my 3.7M index, a fuzzy query for the
> very same reason, kills the query.  There is also a switch on fuzzy
> query that needs to be configurable through Solr, to adjust the
> number of leading characters that are fixed to avoid this all term
> scanning.
>
> There are techniques that can be used to improve the performance of
> in-string types of queries like this, at the expense of indexing time
> and size and clever query creation.   One such technique I've used
> successfully is term rotation enumeration (cat => cat$, at$c, t
> $ca).   This involves custom analyzers and query creation.
>
> Once Solr supports this switch, you may find performance fine with
> leading wildcard queries, but at least be forewarned that there are
> scalability skeletons in this closet.
>
>         Erik
>
>


-- 
Michael Kimsal
http://webdevradio.com

Re: Leading wildcards

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Apr 19, 2007, at 6:56 AM, Michael Kimsal wrote:
> It's bugged us a little bit, because it's something that we need
> (to be able to emulate the previous foo LIKE '%bar%' SQL behaviour  
> we're
> replacing), but can't offer our users yet.

I have also run into this issue and have intended to fix up Solr to  
allow configuring that switch on QueryParser.  I'll eventually get to  
this, but someone supply a patch with a test case would get it done  
sooner.

I must, however, caveat discussion of leading wildcards with the  
underlying effect you get.  If you use standard analysis and perform  
a leading wildcard query, you incur a (possibly) dramatic hit in  
terms of performance.  Lucene has to scan *every* term in the  
specified field.  In fact, with my 3.7M index, a fuzzy query for the  
very same reason, kills the query.  There is also a switch on fuzzy  
query that needs to be configurable through Solr, to adjust the  
number of leading characters that are fixed to avoid this all term  
scanning.

There are techniques that can be used to improve the performance of  
in-string types of queries like this, at the expense of indexing time  
and size and clever query creation.   One such technique I've used  
successfully is term rotation enumeration (cat => cat$, at$c, t 
$ca).   This involves custom analyzers and query creation.

Once Solr supports this switch, you may find performance fine with  
leading wildcard queries, but at least be forewarned that there are  
scalability skeletons in this closet.

	Erik


Re: Leading wildcards

Posted by Michael Kimsal <mg...@gmail.com>.
I've investigated this recently, and it looks like the latest lucene dev
supposedly supports leading/trailing at the same time.  However, I couldn't
get the latest dev solr to build with the latest dev lucene (as of two weeks
ago).  A lucene mailing list seemed to indicate that lucene as of the last
official build support both leading/trailing at the same time, but it then
seemed to indicate that it was a 'in development branch only' state still.
I can't find that thread, but that's my understanding of the current
situation.  It's bugged us a little bit, because it's something that we need
(to be able to emulate the previous foo LIKE '%bar%' SQL behaviour we're
replacing), but can't offer our users yet.

On 4/19/07, Burkamp, Christian <C....@ceyoniq.com> wrote:
>
> Hi there,
>
> Solr does not support leading wildcards, because it uses Lucene's standard
> QueryParser class without changing the defaults. You can easily change this
> by inserting the line
>
> parser.setAllowLeadingWildcards(true);
>
> in QueryParsing.java line 92. (This is after creating a QueryParser
> instance in QueryParsing.parseQuery(...))
>
> and it obviously means that you have to change solr's source code. It
> would be nice to have an option in the schema to switch leading wildcards on
> or off per field. Leading wildcards really make no sense on richly populated
> fields because queries tend to result in too many clauses exceptions most of
> the time.
>
> This works for leading wildcards. Unfortunately it does not enable
> searches with leading AND trailing wildcards. (E.g. searching for "*lega*"
> does not find results even if the term "elegance" is in the index. If you
> put a second asterisk at the end, the term "elegance" is found. (search for
> "*lega**" to get hits).
> Can anybody explain this though it seems to be more of a lucene
> QueryParser issue?
>
> -- Christian
>
> -----Ursprüngliche Nachricht-----
> Von: Maarten.De.Vilder@ibsbe.be [mailto:Maarten.De.Vilder@ibsbe.be]
> Gesendet: Donnerstag, 19. April 2007 08:35
> An: solr-user@lucene.apache.org
> Betreff: Leading wildcards
>
>
> hi,
>
> we have been trying to get the leading wildcards to work.
>
> we have been looking around the Solr website, the Lucene website, wiki's
> and the mailing lists etc ...
> but we found a lot of contradictory information.
>
> so we have a few question :
> - is the latest version of lucene capable of handling leading wildcards ?
> - is the latest version of solr capable of handling leading wildcards ?
> - do we need to make adjustments to the solr source code ?
> - if we need to adjust the solr source, what do we need to change ?
>
> thanks in advance !
> Maarten
>
>


-- 
Michael Kimsal
http://webdevradio.com

AW: Leading wildcards

Posted by "Burkamp, Christian" <C....@Ceyoniq.com>.
Hi there,

Solr does not support leading wildcards, because it uses Lucene's standard QueryParser class without changing the defaults. You can easily change this by inserting the line

parser.setAllowLeadingWildcards(true);

in QueryParsing.java line 92. (This is after creating a QueryParser instance in QueryParsing.parseQuery(...))

and it obviously means that you have to change solr's source code. It would be nice to have an option in the schema to switch leading wildcards on or off per field. Leading wildcards really make no sense on richly populated fields because queries tend to result in too many clauses exceptions most of the time.

This works for leading wildcards. Unfortunately it does not enable searches with leading AND trailing wildcards. (E.g. searching for "*lega*" does not find results even if the term "elegance" is in the index. If you put a second asterisk at the end, the term "elegance" is found. (search for "*lega**" to get hits).
Can anybody explain this though it seems to be more of a lucene QueryParser issue?

-- Christian

-----Ursprüngliche Nachricht-----
Von: Maarten.De.Vilder@ibsbe.be [mailto:Maarten.De.Vilder@ibsbe.be] 
Gesendet: Donnerstag, 19. April 2007 08:35
An: solr-user@lucene.apache.org
Betreff: Leading wildcards


hi,

we have been trying to get the leading wildcards to work.

we have been looking around the Solr website, the Lucene website, wiki's 
and the mailing lists etc ...
but we found a lot of contradictory information.

so we have a few question : 
- is the latest version of lucene capable of handling leading wildcards ? 
- is the latest version of solr capable of handling leading wildcards ?
- do we need to make adjustments to the solr source code ?
- if we need to adjust the solr source, what do we need to change ?

thanks in advance !
Maarten