You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucy.apache.org by Marvin Humphrey <ma...@rectangular.com> on 2011/05/02 17:43:05 UTC

Re: [lucy-dev] Simplifying the Query Parser

On Mon, Apr 18, 2011 at 02:16:33PM -0700, David E. Wheeler wrote:
> On Apr 18, 2011, at 12:41 PM, Marvin Humphrey wrote:

> > The others are all solvable by tightening up the parser:
> > 
> >  * Currently field names must match /[0-9A-Za-z_]+/.  We should require
> >    them to be identifiers, i.e. they must not start with a number:
> >    /[A-Za-z_][0-9A-Za-z_]*/
> >  * QueryParser should use single-token lookahead to enforce that field name
> >    constructs must be followed by something sensible.
> 
> And if it's not, what then?

If it's not, as in the forward slash following the colon in
'http://www.apache.org/', then we consume the whole thing as a leaf --
typically resulting in a phrase query.

> What if it's sensible but the field doesn't exist (or is private)?

For now, we just use the field name.  Starting in 0.2.0, I think we should
consider parsing such constructs as NoMatchQueries.  But that's a more
involved change.

> > This issue should not block 0.1.0, which is almost done.  
> 
> Well, if the handling of PHP::Interpreter is a bug, should that not be fixed
> before 0.1.0?

Yes, I agree.

With r1096201 and r1096207, the two changes proposed above have been
implemented.  There is no change in behavior unless set_heed_colons() has been
invoked.  If heed_colons is true, then the following query strings will now
produce sensible results:

    http://www.apache.org/
    10:30
    PHP::Interpreter

This will still produce an unexpected result:

    mailto:me@example.com

Marvin Humphrey


Re: [lucy-dev] Simplifying the Query Parser

Posted by "David E. Wheeler" <da...@kineticode.com>.
On May 2, 2011, at 8:43 AM, Marvin Humphrey wrote:

>>> The others are all solvable by tightening up the parser:
>>> 
>>> * Currently field names must match /[0-9A-Za-z_]+/.  We should require
>>>   them to be identifiers, i.e. they must not start with a number:
>>>   /[A-Za-z_][0-9A-Za-z_]*/
>>> * QueryParser should use single-token lookahead to enforce that field name
>>>   constructs must be followed by something sensible.
>> 
>> And if it's not, what then?
> 
> If it's not, as in the forward slash following the colon in
> 'http://www.apache.org/', then we consume the whole thing as a leaf --
> typically resulting in a phrase query.

Great. Wouldn't work for mailto:foo@bar.com, though.

>> What if it's sensible but the field doesn't exist (or is private)?
> 
> For now, we just use the field name.  Starting in 0.2.0, I think we should
> consider parsing such constructs as NoMatchQueries.  But that's a more
> involved change.

What does that mean?

>>> This issue should not block 0.1.0, which is almost done.  
>> 
>> Well, if the handling of PHP::Interpreter is a bug, should that not be fixed
>> before 0.1.0?
> 
> Yes, I agree.
> 
> With r1096201 and r1096207, the two changes proposed above have been
> implemented.  There is no change in behavior unless set_heed_colons() has been
> invoked.  If heed_colons is true, then the following query strings will now
> produce sensible results:
> 
>    http://www.apache.org/
>    10:30
>    PHP::Interpreter
> 
> This will still produce an unexpected result:
> 
>    mailto:me@example.com

Sounds like a great improvement, thanks!

David