You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/05/02 17:43:05 UTC
Re: [lucy-dev] Simplifying the Query Parser
On Mon, Apr 18, 2011 at 02:16:33PM -0700, David E. Wheeler wrote:
> On Apr 18, 2011, at 12:41 PM, Marvin Humphrey wrote:
> > The others are all solvable by tightening up the parser:
> >
> > * Currently field names must match /[0-9A-Za-z_]+/. We should require
> > them to be identifiers, i.e. they must not start with a number:
> > /[A-Za-z_][0-9A-Za-z_]*/
> > * QueryParser should use single-token lookahead to enforce that field name
> > constructs must be followed by something sensible.
>
> And if it's not, what then?
If it's not, as in the forward slash following the colon in
'http://www.apache.org/', then we consume the whole thing as a leaf --
typically resulting in a phrase query.
> What if it's sensible but the field doesn't exist (or is private)?
For now, we just use the field name. Starting in 0.2.0, I think we should
consider parsing such constructs as NoMatchQueries. But that's a more
involved change.
> > This issue should not block 0.1.0, which is almost done.
>
> Well, if the handling of PHP::Interpreter is a bug, should that not be fixed
> before 0.1.0?
Yes, I agree.
With r1096201 and r1096207, the two changes proposed above have been
implemented. There is no change in behavior unless set_heed_colons() has been
invoked. If heed_colons is true, then the following query strings will now
produce sensible results:
http://www.apache.org/
10:30
PHP::Interpreter
This will still produce an unexpected result:
mailto:me@example.com
Marvin Humphrey
Re: [lucy-dev] Simplifying the Query Parser
Posted by "David E. Wheeler" <da...@kineticode.com>.
On May 2, 2011, at 8:43 AM, Marvin Humphrey wrote:
>>> The others are all solvable by tightening up the parser:
>>>
>>> * Currently field names must match /[0-9A-Za-z_]+/. We should require
>>> them to be identifiers, i.e. they must not start with a number:
>>> /[A-Za-z_][0-9A-Za-z_]*/
>>> * QueryParser should use single-token lookahead to enforce that field name
>>> constructs must be followed by something sensible.
>>
>> And if it's not, what then?
>
> If it's not, as in the forward slash following the colon in
> 'http://www.apache.org/', then we consume the whole thing as a leaf --
> typically resulting in a phrase query.
Great. Wouldn't work for mailto:foo@bar.com, though.
>> What if it's sensible but the field doesn't exist (or is private)?
>
> For now, we just use the field name. Starting in 0.2.0, I think we should
> consider parsing such constructs as NoMatchQueries. But that's a more
> involved change.
What does that mean?
>>> This issue should not block 0.1.0, which is almost done.
>>
>> Well, if the handling of PHP::Interpreter is a bug, should that not be fixed
>> before 0.1.0?
>
> Yes, I agree.
>
> With r1096201 and r1096207, the two changes proposed above have been
> implemented. There is no change in behavior unless set_heed_colons() has been
> invoked. If heed_colons is true, then the following query strings will now
> produce sensible results:
>
> http://www.apache.org/
> 10:30
> PHP::Interpreter
>
> This will still produce an unexpected result:
>
> mailto:me@example.com
Sounds like a great improvement, thanks!
David