You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Peter Carlson <ca...@bookandhammer.com> on 2002/06/02 00:40:10 UTC

Bug? QueryParser may not correctly interpret RangeQuery text

I am trying to get date range searching use the range query (maybe a bad
choice vs. DateFilter, but I wanted to be able to use it from the query
string).

So I type in a string like
date:[0czi1ceuk-0d0ouet2k]

When I run this through the QueryParser it return only 1 term. That is the
query gets converted to:

date:[0czi1ceuk-0d0ouet2k-null]


This is because the StandardTokenizer sees <alphanum> <p> <has_digit> as a
single token.

Note: <p> can be .,-,_,, and a few other things.

What do people think the right way to handle this issue for the range
queries? My suggestion is to do a indexOf() for "-" and create the one or
two tokens. That is, don't use the analyzer to determine what the tokens are
here. Is there a problem with this?

--Peter


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
>It's true that the unsofisticated end-user would not
>use SQL, but between range (inclusive, exclusive),
>boolean, fuzzy, etc., the simple query parser you have
>is evolving into something more complex than SQL.

Which is a reasonable argument that range queries are outside the scope of 
what the query parser is supposed to do.

>While SQL supports them with key words, we are getting
>into an endless quest for unused characters to mark
>the latest variation of the query.

I wasn't too happy about having added ranges in the first place for exactly 
this reason.  The query parser is supposed to be a convenience, a 90/10 
(actually, more like a 99/1) solution (one which handles 90% of the queries 
with 10% of the work.)  Pushing for that last 10% at the expense of the 
first 90% is a bad tradeoff IMO.  The raw query classes still work fine for 
that last 10% (or 1%).

>By the way, it
>seems that you already have support for the "WHERE
>..." part (AND, OR, NOT, NEAR). If we had "LIKE" and
>"BETWEEN ... AND ..." we would have almost everything
>SQL has for the matching part.

Two responses to this:

1.  Wrong.  We don't have NEAR at all, and AND, OR, and NOT are simple 
operators which give hints to the BooleanQuery class, they don't impart 
structure to the query.  They are, in fact, a convenience for expressing
   (+a +b)
as
   a AND b
mostly because mainstream search engines support AND and OR.

2.  The argument that "we already have half of it, lets go all the way" is 
a siren song.  In lots of cases, this is basically equivalent to "two 
wrongs make a right."  As in "we already violated the XYZ principle for 
some purpose, so there's no point in letting principle get in the way of 
further 'progress'".

In this particular case, its not quite as bad as that, but its taking us in 
a dangerous direction.  The query parser is not a structured language for 
free text queries -- if it was, it should be designed from the ground up to 
be so.  In cramming in too many features, it would be easy for it to lose 
its most valuable feature -- simplicity.  We may have already done that, 
but there's no point in pushing further just in case there's any doubt.

>I think that the only way to have a query that does
>NOT look like a programming language is to have
>natural language understanding (which we won't have
>for a while.) Once the end user is forced to learn the
>difference between terms and operators, he already is
>in the realm of programming languages.

This is a strong argument for backing out some of the features already 
added so far, but I'm sure that's not what you're suggesting (although 
maybe you should be.)

But I think this argument is basically hogwash.  Don't forget we're arguing 
about features which will be used by less than 1% of the user base, and 
probably less than 100th or maybe 1000th of 1% of all queries entered 
through the query parser.

Right now, we have several ways of building queries:
  - a simple query parser, which can handle the basics (terms, phrases, 
field search, slop, wildcards);
  - a flexible and powerful set of query classes with which developers can 
build arbitrary queries;
  - we can combine the above, letting the user enter query terms and 
produce a Query, and then combine that with other query terms based on 
input in a user interface (such as a date picker.)

Now, if you want to design a new query language, one which is actually 
designed for its intended purpose (rather than having features accreted 
every time someone feels that XYZ query structure is critical enough to go 
in the query parser), be my guest -- I'll help, I'll even write the parser 
for you.  We can call it the AdvancedQueryParser or whatever you want to 
call it, and I won't throw stones at your design.

But I'm going to vigourously -1 any proposal for the query parser that 
makes the Joe Users out there pay for features that are only of interest to 
Joe Gooroo.

Nobody has convinced me at all that the existing query parser is inadequate 
for its intended purpose.


--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
>I agree with Alex -- the QueryParser's syntax has become
>at least as obfuscated as SQL, probably more so.  It may
>have started out simple, but it's not simple anymore.

But is that an argument to let it go "all the way" to complete obfuscation, 
or simplify it so that it is more in line with its intended purpose?

Is the fact that something has gotten out of hand an excuse to let it get 
further out of hand?


--
Brian Goetz
Quiotix Corporation
brian@quiotix.com           Tel: 650-843-1300            Fax: 650-324-8032

http://www.quiotix.com


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
I think that it is fairly intuitive depending on what functionality people
are trying to use.

For just a standard set of terms the big question is what should the default
boolean operator be OR or AND.

For the rest, it seems fairly "standard" when looking at other search engine
syntax. Also, when a user wants the added functionality it's available just
by reading some documentation. Searching by field, phrase queries, sub
queries, fuzzy searches, boosting terms, proximity operators.

>From my perspective selling a product that depends on the search engine,
people ask about the complex search features, but often don't use them
(although they would down grade the product unless it had them). It's very
nice to be able to say we have the functionality, and it's one of the
reasons why we choose Lucene because it was so full featured from the get
go.

--Peter



On 6/2/02 9:40 PM, "Eric D. Friedman" <er...@conveysoftware.com> wrote:

> I agree with Alex -- the QueryParser's syntax has become
> at least as obfuscated as SQL, probably more so.  It may
> have started out simple, but it's not simple anymore.
> 
> Eric


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by "Eric D. Friedman" <er...@conveysoftware.com>.
I agree with Alex -- the QueryParser's syntax has become
at least as obfuscated as SQL, probably more so.  It may
have started out simple, but it's not simple anymore.

Eric

On Sun, 2 Jun 2002, Alex Murzaku wrote:

> It's true that the unsofisticated end-user would not
> use SQL, but between range (inclusive, exclusive),
> boolean, fuzzy, etc., the simple query parser you have
> is evolving into something more complex than SQL.
> While SQL supports them with key words, we are getting
> into an endless quest for unused characters to mark
> the latest variation of the query. By the way, it
> seems that you already have support for the "WHERE
> ..." part (AND, OR, NOT, NEAR). If we had "LIKE" and
> "BETWEEN ... AND ..." we would have almost everything
> SQL has for the matching part.
>
> I think that the only way to have a query that does
> NOT look like a programming language is to have
> natural language understanding (which we won't have
> for a while.) Once the end user is forced to learn the
> difference between terms and operators, he already is
> in the realm of programming languages.
>
>
> --- Brian Goetz <br...@quiotix.com> wrote:
> > > Maybe we could even throw in full support for SQL
> > like
> > > SELECT, WHERE, etc. As far as I remember, JavaCC
> > used
> > > to have an SQL parser as well, so, I assume we
> > would
> > > just need the translation to a Lucene query. I am
> > sure
> > > everybody would appreciate using some syntax with
> > > which they are already familiar.
> >
> > But the query parser is targeted at end users, whose
> > level of "query
> > sophistication" is searching on Yahoo and eBay, not
> > writing SQL.  The
> > query parser language should NOT look like a
> > programming language.
> >
> > Don't forget who the audience is!
> >
> > --
> > To unsubscribe, e-mail:
> > <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> > <ma...@jakarta.apache.org>
> >
>
>
> __________________________________________________
> Do You Yahoo!?
> Yahoo! - Official partner of 2002 FIFA World Cup
> http://fifaworldcup.yahoo.com
>
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
>


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Alex Murzaku <mu...@yahoo.com>.
It's true that the unsofisticated end-user would not
use SQL, but between range (inclusive, exclusive),
boolean, fuzzy, etc., the simple query parser you have
is evolving into something more complex than SQL.
While SQL supports them with key words, we are getting
into an endless quest for unused characters to mark
the latest variation of the query. By the way, it
seems that you already have support for the "WHERE
..." part (AND, OR, NOT, NEAR). If we had "LIKE" and
"BETWEEN ... AND ..." we would have almost everything
SQL has for the matching part.

I think that the only way to have a query that does
NOT look like a programming language is to have
natural language understanding (which we won't have
for a while.) Once the end user is forced to learn the
difference between terms and operators, he already is
in the realm of programming languages.


--- Brian Goetz <br...@quiotix.com> wrote:
> > Maybe we could even throw in full support for SQL
> like
> > SELECT, WHERE, etc. As far as I remember, JavaCC
> used
> > to have an SQL parser as well, so, I assume we
> would
> > just need the translation to a Lucene query. I am
> sure
> > everybody would appreciate using some syntax with
> > which they are already familiar.
> 
> But the query parser is targeted at end users, whose
> level of "query
> sophistication" is searching on Yahoo and eBay, not
> writing SQL.  The
> query parser language should NOT look like a
> programming language.
> 
> Don't forget who the audience is!  
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> Maybe we could even throw in full support for SQL like
> SELECT, WHERE, etc. As far as I remember, JavaCC used
> to have an SQL parser as well, so, I assume we would
> just need the translation to a Lucene query. I am sure
> everybody would appreciate using some syntax with
> which they are already familiar.

But the query parser is targeted at end users, whose level of "query
sophistication" is searching on Yahoo and eBay, not writing SQL.  The
query parser language should NOT look like a programming language.

Don't forget who the audience is!  

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Alex Murzaku <mu...@yahoo.com>.
How about just the SQL "BETWEEN ... AND ..."?

Maybe we could even throw in full support for SQL like
SELECT, WHERE, etc. As far as I remember, JavaCC used
to have an SQL parser as well, so, I assume we would
just need the translation to a Lucene query. I am sure
everybody would appreciate using some syntax with
which they are already familiar.

--- Peter Carlson <ca...@bookandhammer.com> wrote:
> I like this idea of [GOOP:GOOP] as it gives the most
> flexibility. However,
> this requires the field to have a known
> characteristic like a date field,
> number field or text field correct? If you just use
> the static Field.Date
> this would require adding a new attribute the field
> class? I like this idea
> but I don�t know the difficulty / backward
> compatibility issues.
> 
> If the extra field attribute is too difficult, then
> I suggest we use the
> nnnn-nn-nn format method so we can use the pattern
> to determine the data
> type.
> 
> For number fields, should this support only
> integers, or decimal numbers
> too? 
> 
> I don't think we should use the : character, because
> we probably want to
> support time formats in the date format. Something
> like 03/01/2001 at
> 00:01:00. Maybe something like ">" or "|" or even
> "->" ?
> 
> Also, inclusive vs. exclusive should be accounted
> for with the [ vs {
> characters.  I think this might already be done, but
> just wanted to throw it
> out there.
> 
> --Peter
> 
> 
> On 6/2/02 2:13 AM, "Brian Goetz" <br...@quiotix.com>
> wrote:
> 
> >>> How about:
> >>> 
> >>>  DATE = nnnn-nn-nn
> >>>  NUMBER = n*
> >>>  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
> >>> 
> >>> An alternate, less parse-oriented approach would
> be this:
> >>>   RANGE = [ GOOP : GOOP ]
> >>> where
> >>>   GOOP = any string of letters/numbers not
> containing : or ].
> >> 
> >> I'd go for the first one as it's more explicit. 
> However, perhaps the
> >> second approach is more extensible?
> > 
> > When I first did the query parser, I defined terms
> by inclusion
> > (stating valid characters) instead of exclusion
> (excluding non-term
> > characters.)  Turns out I missed quite a few in
> the first go around,
> > which taught me the lesson (again) that sometimes
> trying to be too
> > specific is a rats nest.  What about dates like
> 02-Mai-2002 (not a
> > typo, french for May)?  Letting DateFormat figure
> it out has some
> > merit.
> > 
> >> DateField(Date) and NumberField(int) sounds
> right, but wouldn't Field
> >> class make more sense?
> > 
> > I had in mind static methods of Field, just like
> Field.Text --
> > Field.Date, Field.Number.   Sorry if that wasn't
> clear.  This seems
> > an easy addition.
> > 
> > --
> > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > 
> > 
> 
> 
> --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> Should date fields be tokenized? Current the Field.Keyword(String, Date)
> constructor has the Date tokenized, but this doesn't seem right.

No, I thought I copied from Keyword so it wouldn't be tokenized.  Did
I screw it up?

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Doug Cutting <cu...@lucene.com>.
Brian Goetz wrote:
> The only part I'm not sure about here is what to do with negative
> numbers.  I'm sure there's some representation which gives the desired
> result when sorted lexicographically; Doug?  

Two's complement comes to mind.  Convert to two's complement and print 
as hex, padded to 32 bits, or 64 if you want to be able to intermix ints 
and longs.  Floats work too: convert the pre-decimal part to two's 
complement and pad.

Doug


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> > Also, we talked about adding numeric ranges too.  First we'd need a
> > NumberField class and "constructor", similar to DateField, and then
> > we'd want to use NumberFormat to see if the elements in the range query
> > are of the right format.

The only part I'm not sure about here is what to do with negative
numbers.  I'm sure there's some representation which gives the desired
result when sorted lexicographically; Doug?  

> > I'd like to hand off the specifics of date format parsing to someone
> > else, since I'm pretty pressed for time; I've done the part that
> > involves the parser, which is the high-risk part.

Something that handled a variety of date formats would be nice too.
The standard DateFormat handles nn/nn/nn[nn], but not nnnn-nn-nn or
nn.nn.nnnn or any of the others.  I don't want to get carried away
here with a pluggable date parser -- I just think we should choose
some reasonable representation and call it good.

But we're still stuck on date formats with time -- right now, we're
using whitespace as delimiters in the range -- which means that dates
like 
  2002-01-01 1:35 PM
are no good unless enclosed in quotes.  


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
I am finally getting a few moments to look at this.

Should date fields be tokenized? Current the Field.Keyword(String, Date)
constructor has the Date tokenized, but this doesn't seem right.

--Peter


On 6/24/02 5:09 PM, "Brian Goetz" <br...@quiotix.com> wrote:

>> Just curious what the status of this issue is, as the discussion seems
>> to have stopped.
> 
> I checked in a first cut at this facility:
> 
> - Field.Keyword(String, Date) "constructor", which uses DateField format;
> - Extensions to query parser as discussed earlier;
> - test cases.
> 
> More work needs to be done on the date parsing code, but its pretty
> much separated from the query parsing now.  Right now, we accept dates
> of the form "nn/nn/nn", as supported by
> DateFormat.getDateInstance(DateFormat.SHORT).  I think we want to be
> more flexible about this, and also support date-times as well (the new
> parser stuff includes support for quoted strings inside range expressions,
> so we can include things with spaces that way.)
> 
> Also, we talked about adding numeric ranges too.  First we'd need a
> NumberField class and "constructor", similar to DateField, and then
> we'd want to use NumberFormat to see if the elements in the range query
> are of the right format.
> 
> I'd like to hand off the specifics of date format parsing to someone
> else, since I'm pretty pressed for time; I've done the part that
> involves the parser, which is the high-risk part.
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
Brian,

This sounds great. I'll have a look at it and hopefully have some time to
help finish it up. Although I am a bit behind on a few Lucene project
myself.

--Peter



On 6/24/02 5:09 PM, "Brian Goetz" <br...@quiotix.com> wrote:

>> Just curious what the status of this issue is, as the discussion seems
>> to have stopped.
> 
> I checked in a first cut at this facility:
> 
> - Field.Keyword(String, Date) "constructor", which uses DateField format;
> - Extensions to query parser as discussed earlier;
> - test cases.
> 
> More work needs to be done on the date parsing code, but its pretty
> much separated from the query parsing now.  Right now, we accept dates
> of the form "nn/nn/nn", as supported by
> DateFormat.getDateInstance(DateFormat.SHORT).  I think we want to be
> more flexible about this, and also support date-times as well (the new
> parser stuff includes support for quoted strings inside range expressions,
> so we can include things with spaces that way.)
> 
> Also, we talked about adding numeric ranges too.  First we'd need a
> NumberField class and "constructor", similar to DateField, and then
> we'd want to use NumberFormat to see if the elements in the range query
> are of the right format.
> 
> I'd like to hand off the specifics of date format parsing to someone
> else, since I'm pretty pressed for time; I've done the part that
> involves the parser, which is the high-risk part.
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> Just curious what the status of this issue is, as the discussion seems
> to have stopped.

I checked in a first cut at this facility: 

  - Field.Keyword(String, Date) "constructor", which uses DateField format;
  - Extensions to query parser as discussed earlier; 
  - test cases.

More work needs to be done on the date parsing code, but its pretty
much separated from the query parsing now.  Right now, we accept dates
of the form "nn/nn/nn", as supported by
DateFormat.getDateInstance(DateFormat.SHORT).  I think we want to be
more flexible about this, and also support date-times as well (the new 
parser stuff includes support for quoted strings inside range expressions,
so we can include things with spaces that way.)  

Also, we talked about adding numeric ranges too.  First we'd need a
NumberField class and "constructor", similar to DateField, and then
we'd want to use NumberFormat to see if the elements in the range query
are of the right format.

I'd like to hand off the specifics of date format parsing to someone
else, since I'm pretty pressed for time; I've done the part that
involves the parser, which is the high-risk part.  


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
Awesome.

--Peter

On 6/6/02 6:05 AM, "Brian Goetz" <br...@quiotix.com> wrote:

>> I 100% agree with you about having actual field types, I just don't know
>> what the ramifications are and haven't looked. I was hoping Doug would chime
>> in and make it easy for us.
> 
> Doug's suggestion seemed OK to me as a first and easy cut at this.
> 
> Sounds like we've got agreement on TO as a range separator.
> 
> Sounds like we've got agreement on converting goop to token values.
> 
> I'll work on it this week.
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> I 100% agree with you about having actual field types, I just don't know
> what the ramifications are and haven't looked. I was hoping Doug would chime
> in and make it easy for us.

Doug's suggestion seemed OK to me as a first and easy cut at this.  

Sounds like we've got agreement on TO as a range separator.  

Sounds like we've got agreement on converting goop to token values.  

I'll work on it this week.


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> The first part of this is easy: just add new Field "constuctor" methods 
> that take Date and number parameters, e.g.:
>    Field.Keyword(String name, Date value);
>    Field.Keyword(String name, int value);

By the way, is there any NumberField class, which provides String <-->
int support, in the same way that the DateField class provides String
<--> Date support?


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Doug Cutting <cu...@lucene.com>.
Brian Goetz wrote:
> I still want to see Date and Number fields supported as basic types in
> the Field class, rather than "use a String in this magic date format".

The first part of this is easy: just add new Field "constuctor" methods 
that take Date and number parameters, e.g.:
   Field.Keyword(String name, Date value);
   Field.Keyword(String name, int value);

(By the way, the reason I capitalized these static methods originally 
was because I thought of them as constructors, which are capitalized, 
but in retrospect it was probably a bad idea.  It was 1997 and this was 
my first Java program...)

The question is, what happens next?  The simple approach would be to 
convert these values into strings that sort lexicographically the same 
as the dates and numbers, as is done by DateField.  But how then do you 
get the Date and number values back?

The easy approach would be to add Document methods that returned the 
value of a field as a Date or number, e.g.:
   Date Document.getDate(String name);
   int Document.getInt(String name);

The downside of this approach is that typing is weak.  To fix that would 
require a revamp of the way that Lucene stores fields.  In the general 
case, we might want to move to using object serialization for field 
values, however I suspect this would be much slower than the existing 
implementation, and document de-serialization is already a performance 
bottleneck.

A serialization-based approach might replace the Keyword method with:
   Field.Keyword(String name, Object value);
and the Document.get method with:
   Object Document.get(String name);

Are you okay with the easy approach?  Does someone want to explore using 
serialization for field values?

Doug


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
Brian,

I 100% agree with you about having actual field types, I just don't know
what the ramifications are and haven't looked. I was hoping Doug would chime
in and make it easy for us.

It would be nice to have it working though. I vote +1 for " TO "

And for the format into date then number methodology seems great to me. If
we want to support full on terms we could also say if it doesn't just have
digits then its just a term. So Date -> Number -> Term.

--Peter


On 6/5/02 12:20 AM, "Brian Goetz" <br...@quiotix.com> wrote:

>> I guess from my perspective we are at
>> 
>> field:[<goop>-><goop>]
>> 
>> The delimiter is not yet defined, but the options currently discussed are
>> -
>> ->
>> ;
>> :
>> |
>>> 
>> 
>> The problem with - and : is that they may be part of a date format.
> 
> How about " TO " as the delimiter?
> 
>> The action taken by the QueryParser would depend on the type of field we
>> were using (if that were an easy change). For Date fields, it would convert
>> the <goop> to a Date using the SimpleDateFormat and try to guess the format
>> (I think it will handle the ISO 8601 formats).
> 
> Or, what about this: try to parse it as a date as above, using a
> DateFormat.  If that fails, try to parse it as a number.
> 
> I still want to see Date and Number fields supported as basic types in
> the Field class, rather than "use a String in this magic date format".
> 
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> I guess from my perspective we are at
> 
> field:[<goop>-><goop>]
> 
> The delimiter is not yet defined, but the options currently discussed are
> -
> ->
> ;
> :
> |
> >
> 
> The problem with - and : is that they may be part of a date format.

How about " TO " as the delimiter?

> The action taken by the QueryParser would depend on the type of field we
> were using (if that were an easy change). For Date fields, it would convert
> the <goop> to a Date using the SimpleDateFormat and try to guess the format
> (I think it will handle the ISO 8601 formats).

Or, what about this: try to parse it as a date as above, using a
DateFormat.  If that fails, try to parse it as a number.  

I still want to see Date and Number fields supported as basic types in
the Field class, rather than "use a String in this magic date format".


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
I guess from my perspective we are at

field:[<goop>-><goop>]

The delimiter is not yet defined, but the options currently discussed are
-
->
;
:
|
>

The problem with - and : is that they may be part of a date format.

The action taken by the QueryParser would depend on the type of field we
were using (if that were an easy change). For Date fields, it would convert
the <goop> to a Date using the SimpleDateFormat and try to guess the format
(I think it will handle the ISO 8601 formats).

OR

If adding a type to a field is difficult, then the next option is to just
support a date range and assume the data is a date.

OR

If adding a type to a field is difficult and we don't want to just support a
Date format, then we would create a specific format like
YYYY/MM/DDTHH:MM:SS
For dates and just a set of digits for numbers.


Does that sound about right? If so what's are people preference?


My preferences are 
Solve with Option 3 now, but determine how to solve with option 1.

Delimiter preference would be ">" It seem intuitive to me.

--Peter



On 6/4/02 10:17 PM, "Otis Gospodnetic" <ot...@yahoo.com> wrote:

> Hello,
> 
> Just curious what the status of this issue is, as the discussion seems
> to have stopped.
> 
> --- "Eric D. Friedman" <er...@conveysoftware.com> wrote:
>> Instead of reinventing the wheel for representing dates, how about
>> using an existing standard?  ISO 8601 defines a simple lexical
>> representation for dates, times (with optional millisecond
>> precision),
>> and timezones that is easy to implement.  This is what's used in the
>> XML Schema "dateTime" datatype.
>> 
>> A summary of the ISO 8601 notation is available here:
>> http://www.cl.cam.ac.uk/~mgk25/iso-time.html
>> 
>> The documentation for the XML Schema dateTime datatype is here:
>> http://www.w3.org/TR/xmlschema-2/#dateTime
> 
> I agree, that is why I immediately suggested YYYY-MM-DD.  I dislike
> U.S.-centric or Europe-centric approaches when there is a standard
> format.
> 
>> I whipped up a JavaCC parser to handle this lexical representation
>> (see
>> attachment).
>> 
>> Note that for this to be useful in QueryParser, it's going to need
>> its
>> own lexical state.  This makes sense anyway, since it would be a
>> mistake to have the query syntax infer magical properties about
>> strings
>> that appear to be dates.  Better is to have a keyword in the query
>> syntax that introduces a date value:  something like date(<VALUE>)
>> would work.  So would to_date(<VALUE>) for those who know SQL. I
>> would
>> have suggested date:<VALUE> but I think that already means something
>> in
>> the QueryParser's lexical specification. (I don't actually use
>> QueryParser because the patches I've submitted previously haven't
>> made
>> it in yet, and until they do, QP is fatally crippled for my
>> purposes).
> 
> I'll try to look for your patches in the archives (if you have the URL
> handly please send it to me), so that I can put it on the TODO list, if
> it makes sense to do so.
> As for the above comments about the parser, I'm afraid I'm still a
> JavaCC neophite. I don't dislike date(<VALUE>) approach.  If users can
> grasp field:value they shouldn't have a problem with field:date(value),
> I think.
> 
> Otis
> 
> 
>> On Sun, 2 Jun 2002, Peter Carlson wrote:
>> 
>>> I like this idea of [GOOP:GOOP] as it gives the most flexibility.
>> However,
>>> this requires the field to have a known characteristic like a date
>> field,
>>> number field or text field correct? If you just use the static
>> Field.Date
>>> this would require adding a new attribute the field class? I like
>> this idea
>>> but I don?t know the difficulty / backward compatibility issues.
>>> 
>>> If the extra field attribute is too difficult, then I suggest we
>> use the
>>> nnnn-nn-nn format method so we can use the pattern to determine the
>> data
>>> type.
>>> 
>>> For number fields, should this support only integers, or decimal
>> numbers
>>> too?
>>> 
>>> I don't think we should use the : character, because we probably
>> want to
>>> support time formats in the date format. Something like 03/01/2001
>> at
>>> 00:01:00. Maybe something like ">" or "|" or even "->" ?
>>> 
>>> Also, inclusive vs. exclusive should be accounted for with the [ vs
>> {
>>> characters.  I think this might already be done, but just wanted to
>> throw it
>>> out there.
>>> 
>>> --Peter
>>> 
>>> 
>>> On 6/2/02 2:13 AM, "Brian Goetz" <br...@quiotix.com> wrote:
>>> 
>>>>>> How about:
>>>>>> 
>>>>>>  DATE = nnnn-nn-nn
>>>>>>  NUMBER = n*
>>>>>>  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
>>>>>> 
>>>>>> An alternate, less parse-oriented approach would be this:
>>>>>>   RANGE = [ GOOP : GOOP ]
>>>>>> where
>>>>>>   GOOP = any string of letters/numbers not containing : or ].
>>>>> 
>>>>> I'd go for the first one as it's more explicit.  However,
>> perhaps the
>>>>> second approach is more extensible?
>>>> 
>>>> When I first did the query parser, I defined terms by inclusion
>>>> (stating valid characters) instead of exclusion (excluding
>> non-term
>>>> characters.)  Turns out I missed quite a few in the first go
>> around,
>>>> which taught me the lesson (again) that sometimes trying to be
>> too
>>>> specific is a rats nest.  What about dates like 02-Mai-2002 (not
>> a
>>>> typo, french for May)?  Letting DateFormat figure it out has some
>>>> merit.
>>>> 
>>>>> DateField(Date) and NumberField(int) sounds right, but wouldn't
>> Field
>>>>> class make more sense?
>>>> 
>>>> I had in mind static methods of Field, just like Field.Text --
>>>> Field.Date, Field.Number.   Sorry if that wasn't clear.  This
>> seems
>>>> an easy addition.
>>>> 
>>>> --
>>>> To unsubscribe, e-mail:
>> <ma...@jakarta.apache.org>
>>>> For additional commands, e-mail:
>> <ma...@jakarta.apache.org>
>>>> 
>>>> 
>>> 
>>> 
>>> --
>>> To unsubscribe, e-mail:
>> <ma...@jakarta.apache.org>
>>> For additional commands, e-mail:
>> <ma...@jakarta.apache.org>
>>> 
>>> PARSER_BEGIN(ISO8601Parser)
>> 
>> import java.io.*;
>> import java.util.*;
>> import java.text.*;
>> 
>> public class ISO8601Parser {
>> 
>>   static DateFormat fmt;
>> 
>>   public static void main(String args[]) throws ParseException {
>>     String date;
>> 
>>     //date = "1999-05-31T13:20:00Z";
>>     //date = "1999-05-31T13:20:00-00:01";
>>     date = "1999-05-31T13:20:00.999-08:00";
>> 
>>     TimeZone utc = TimeZone.getTimeZone("UTC");
>>     fmt = DateFormat.getDateTimeInstance();
>>     fmt.setTimeZone(utc);
>> 
>>     ISO8601Parser parser = new ISO8601Parser(new StringReader(date));
>>     Date d = parser.date();
>>     System.out.println(fmt.format(d));
>>   }
>> }
>> 
>> PARSER_END(ISO8601Parser)
>> 
>> TOKEN :
>> {
>>   <#DIGIT: ["0"-"9"]>
>> | <TWOD: <DIGIT><DIGIT>>         // two digits used for day, month,
>> hours, minutes, seconds
>> | <MILLIS: <TWOD><DIGIT>>        // millisecond precision is 000 ..
>> 999
>> | <YEAR: <TWOD><TWOD>(<DIGIT>)*> // at least 4 digits, but possibly
>> more
>> | <DASH: "-">                    // delimiter for CCYY-MM-DD; doubles
>> as minus sign for signed ints
>> | <COLON: ":">                   // delimiter for hh:mm:ss
>> | <DOT: ".">                     // delimiter for ss.mmm
>> (milliseconds)
>> | <T: "T" >                      // delimiter between date and time
>> | <Z: "Z" >                      // UTC timezone
>> | <PLUS: "+">                    // indicates positive offset from
>> UTC
>> }
>> 
>> /**
>>  * Input to this production is a series of tokens matching the
>> following specification:
>>  * CCYY-MM-DD -- a date with no time specification<br>
>>  * CCYY-MM-DDThh:mm:ss -- a timestamp implicitly in the UTC
>> timezone<br>
>>  * CCYY-MM-DDThh:mm:ssZ -- a timestamp explicitly in the UTC
>> timezone<br>
>>  * CCYY-MM-DDThh:mm:ss-08:00 -- a timestamp with a negative 8 hour
>> offset from UTC<br>
>>  * CCYY-MM-DDThh:mm:ss.mmm -- a timestamp with millisecond
>> precision<br>
>>  * -CCYY-MM-DD -- a date whose year is before the common era
>> (BCE)<br>
>>  * NNCCYY-MM-DD -- a date whose year is > 9999<br>
>>  *
>>  * <p> Note that years greater than 9999 are allowed, but that 0000
>> is not a valid year.
>>  * Negative numbers are allowed when representing years BCE.
>>  * </p>
>>  *
>>  * <p>Milliseconds are optional in the seconds field.  The timezone
>> indicator is optional.
>>  * </p>
>>  *
>>  *@return a java.util.Date instance in the UTC timezone, with
>> millisecond precision.
>>  */
>> Date date() :
>> {
>>   int CCYY = 0, MM = 0, DD = 0, hh = 0, mm = 0, ss = 0, millis = 0;
>>   int deltahh = 0, deltamm = 0;
>>   boolean deltaPlus = true;
>>   Calendar c = Calendar.getInstance(TimeZone.getTimeZone("UTC"));
>> }
>> {
>>   CCYY = year() <DASH>
>>   MM = twod() <DASH>
>>   DD = twod()
>>   {
>>     MM--; // months are 0 based
>>     c.set(c.YEAR, CCYY);
>>     c.set(c.MONTH, MM);
>>     c.set(c.DAY_OF_MONTH, DD);
>>   }
>>   (
>>     <T>
>>     hh = twod() <COLON>
>>     mm = twod() <COLON>
>>     ss = twod()
>>     {
>>       c.set(c.HOUR_OF_DAY, hh);
>>       c.set(c.MINUTE, mm);
>>       c.set(c.SECOND, ss);
>>     }
>>     (
>>       <DOT>
>>       millis = millis()
>>       {
>>         c.set(c.MILLISECOND, millis);
>>       }
>>     )?
>>     (
>>       <Z> // we're already in UTC, so no adjustment needed
>>       |
>>       (
>>         (
>>           <PLUS> // somewhere ahead of UTC (east of Greenwich)
>>           |
>>           <DASH> // behind UTC (west of Greenwich)
>>           {
>>             deltaPlus = false;
>>           }
>>         )
>>         deltahh = twod() <COLON>
>>         deltamm = twod()
>>         {
>>           if (! deltaPlus) {
>>             deltahh = -deltahh;
>>             deltamm = -deltamm;
>>           }
>>           // millisecond offset
>>           int offsetFromUTC = ((deltahh * 60) + deltamm) * 60 * 1000;
>>           c.set(c.ZONE_OFFSET, offsetFromUTC);
>>         }
>>       )
>>     )?
>>   )?
>>   {
>>     return c.getTime();
>>   }
>> }
>> 
>> int millis() :
>> {
>>   Token t;
>> }
>> {
>>   t = <MILLIS> {
>>     return Integer.parseInt(t.image);
>>   }
>> }
>> 
>> int twod() :
>> {
>>   Token t;
>> }
>> {
>>   t = <TWOD> {
>>     return Integer.parseInt(t.image);
>>   }
>> }
>> 
>> int year() :
>> {
>>   Token t;
>>   boolean positive = true;
>> }
>> {
>>   (
>>     <DASH>
>>     {
>>       positive = false;
>>     }
>>   )?
>>   t = <YEAR> {
>>     int year = Integer.parseInt(t.image);
>>     if (year == 0) {
>>       throw new IllegalArgumentException("0000 is not a legal year");
>>     }
>>     return positive ? year : -year;
>>   }
>> }
>>> --
>> To unsubscribe, e-mail:
>> <ma...@jakarta.apache.org>
>> For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> 
> 
> __________________________________________________
> Do You Yahoo!?
> Yahoo! - Official partner of 2002 FIFA World Cup
> http://fifaworldcup.yahoo.com
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

Just curious what the status of this issue is, as the discussion seems
to have stopped.

--- "Eric D. Friedman" <er...@conveysoftware.com> wrote:
> Instead of reinventing the wheel for representing dates, how about
> using an existing standard?  ISO 8601 defines a simple lexical
> representation for dates, times (with optional millisecond
> precision),
> and timezones that is easy to implement.  This is what's used in the
> XML Schema "dateTime" datatype.
> 
> A summary of the ISO 8601 notation is available here:
> http://www.cl.cam.ac.uk/~mgk25/iso-time.html
> 
> The documentation for the XML Schema dateTime datatype is here:
> http://www.w3.org/TR/xmlschema-2/#dateTime

I agree, that is why I immediately suggested YYYY-MM-DD.  I dislike
U.S.-centric or Europe-centric approaches when there is a standard
format.

> I whipped up a JavaCC parser to handle this lexical representation
> (see
> attachment).
> 
> Note that for this to be useful in QueryParser, it's going to need
> its
> own lexical state.  This makes sense anyway, since it would be a
> mistake to have the query syntax infer magical properties about
> strings
> that appear to be dates.  Better is to have a keyword in the query
> syntax that introduces a date value:  something like date(<VALUE>)
> would work.  So would to_date(<VALUE>) for those who know SQL. I
> would
> have suggested date:<VALUE> but I think that already means something
> in
> the QueryParser's lexical specification. (I don't actually use
> QueryParser because the patches I've submitted previously haven't
> made
> it in yet, and until they do, QP is fatally crippled for my
> purposes).

I'll try to look for your patches in the archives (if you have the URL
handly please send it to me), so that I can put it on the TODO list, if
it makes sense to do so.
As for the above comments about the parser, I'm afraid I'm still a
JavaCC neophite. I don't dislike date(<VALUE>) approach.  If users can
grasp field:value they shouldn't have a problem with field:date(value),
I think.

Otis


> On Sun, 2 Jun 2002, Peter Carlson wrote:
> 
> > I like this idea of [GOOP:GOOP] as it gives the most flexibility.
> However,
> > this requires the field to have a known characteristic like a date
> field,
> > number field or text field correct? If you just use the static
> Field.Date
> > this would require adding a new attribute the field class? I like
> this idea
> > but I don�t know the difficulty / backward compatibility issues.
> >
> > If the extra field attribute is too difficult, then I suggest we
> use the
> > nnnn-nn-nn format method so we can use the pattern to determine the
> data
> > type.
> >
> > For number fields, should this support only integers, or decimal
> numbers
> > too?
> >
> > I don't think we should use the : character, because we probably
> want to
> > support time formats in the date format. Something like 03/01/2001
> at
> > 00:01:00. Maybe something like ">" or "|" or even "->" ?
> >
> > Also, inclusive vs. exclusive should be accounted for with the [ vs
> {
> > characters.  I think this might already be done, but just wanted to
> throw it
> > out there.
> >
> > --Peter
> >
> >
> > On 6/2/02 2:13 AM, "Brian Goetz" <br...@quiotix.com> wrote:
> >
> > >>> How about:
> > >>>
> > >>>  DATE = nnnn-nn-nn
> > >>>  NUMBER = n*
> > >>>  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
> > >>>
> > >>> An alternate, less parse-oriented approach would be this:
> > >>>   RANGE = [ GOOP : GOOP ]
> > >>> where
> > >>>   GOOP = any string of letters/numbers not containing : or ].
> > >>
> > >> I'd go for the first one as it's more explicit.  However,
> perhaps the
> > >> second approach is more extensible?
> > >
> > > When I first did the query parser, I defined terms by inclusion
> > > (stating valid characters) instead of exclusion (excluding
> non-term
> > > characters.)  Turns out I missed quite a few in the first go
> around,
> > > which taught me the lesson (again) that sometimes trying to be
> too
> > > specific is a rats nest.  What about dates like 02-Mai-2002 (not
> a
> > > typo, french for May)?  Letting DateFormat figure it out has some
> > > merit.
> > >
> > >> DateField(Date) and NumberField(int) sounds right, but wouldn't
> Field
> > >> class make more sense?
> > >
> > > I had in mind static methods of Field, just like Field.Text --
> > > Field.Date, Field.Number.   Sorry if that wasn't clear.  This
> seems
> > > an easy addition.
> > >
> > > --
> > > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> > >
> > >
> >
> >
> > --
> > To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> > For additional commands, e-mail:
> <ma...@jakarta.apache.org>
> >
> > PARSER_BEGIN(ISO8601Parser)
> 
> import java.io.*;
> import java.util.*;
> import java.text.*;
> 
> public class ISO8601Parser {
> 
>   static DateFormat fmt;
> 
>   public static void main(String args[]) throws ParseException {
>     String date;
> 
>     //date = "1999-05-31T13:20:00Z";
>     //date = "1999-05-31T13:20:00-00:01";
>     date = "1999-05-31T13:20:00.999-08:00";
> 
>     TimeZone utc = TimeZone.getTimeZone("UTC");
>     fmt = DateFormat.getDateTimeInstance();
>     fmt.setTimeZone(utc);
> 
>     ISO8601Parser parser = new ISO8601Parser(new StringReader(date));
>     Date d = parser.date();
>     System.out.println(fmt.format(d));
>   }
> }
> 
> PARSER_END(ISO8601Parser)
> 
> TOKEN :
> {
>   <#DIGIT: ["0"-"9"]>
> | <TWOD: <DIGIT><DIGIT>>         // two digits used for day, month,
> hours, minutes, seconds
> | <MILLIS: <TWOD><DIGIT>>        // millisecond precision is 000 ..
> 999
> | <YEAR: <TWOD><TWOD>(<DIGIT>)*> // at least 4 digits, but possibly
> more
> | <DASH: "-">                    // delimiter for CCYY-MM-DD; doubles
> as minus sign for signed ints
> | <COLON: ":">                   // delimiter for hh:mm:ss
> | <DOT: ".">                     // delimiter for ss.mmm
> (milliseconds)
> | <T: "T" >                      // delimiter between date and time
> | <Z: "Z" >                      // UTC timezone
> | <PLUS: "+">                    // indicates positive offset from
> UTC
> }
> 
> /**
>  * Input to this production is a series of tokens matching the
> following specification:
>  * CCYY-MM-DD -- a date with no time specification<br>
>  * CCYY-MM-DDThh:mm:ss -- a timestamp implicitly in the UTC
> timezone<br>
>  * CCYY-MM-DDThh:mm:ssZ -- a timestamp explicitly in the UTC
> timezone<br>
>  * CCYY-MM-DDThh:mm:ss-08:00 -- a timestamp with a negative 8 hour
> offset from UTC<br>
>  * CCYY-MM-DDThh:mm:ss.mmm -- a timestamp with millisecond
> precision<br>
>  * -CCYY-MM-DD -- a date whose year is before the common era
> (BCE)<br>
>  * NNCCYY-MM-DD -- a date whose year is > 9999<br>
>  *
>  * <p> Note that years greater than 9999 are allowed, but that 0000
> is not a valid year.
>  * Negative numbers are allowed when representing years BCE.
>  * </p>
>  *
>  * <p>Milliseconds are optional in the seconds field.  The timezone
> indicator is optional.
>  * </p>
>  *
>  *@return a java.util.Date instance in the UTC timezone, with
> millisecond precision.
>  */
> Date date() :
> {
>   int CCYY = 0, MM = 0, DD = 0, hh = 0, mm = 0, ss = 0, millis = 0;
>   int deltahh = 0, deltamm = 0;
>   boolean deltaPlus = true;
>   Calendar c = Calendar.getInstance(TimeZone.getTimeZone("UTC"));
> }
> {
>   CCYY = year() <DASH>
>   MM = twod() <DASH>
>   DD = twod()
>   {
>     MM--; // months are 0 based
>     c.set(c.YEAR, CCYY);
>     c.set(c.MONTH, MM);
>     c.set(c.DAY_OF_MONTH, DD);
>   }
>   (
>     <T>
>     hh = twod() <COLON>
>     mm = twod() <COLON>
>     ss = twod()
>     {
>       c.set(c.HOUR_OF_DAY, hh);
>       c.set(c.MINUTE, mm);
>       c.set(c.SECOND, ss);
>     }
>     (
>       <DOT>
>       millis = millis()
>       {
>         c.set(c.MILLISECOND, millis);
>       }
>     )?
>     (
>       <Z> // we're already in UTC, so no adjustment needed
>       |
>       (
>         (
>           <PLUS> // somewhere ahead of UTC (east of Greenwich)
>           |
>           <DASH> // behind UTC (west of Greenwich)
>           {
>             deltaPlus = false;
>           }
>         )
>         deltahh = twod() <COLON>
>         deltamm = twod()
>         {
>           if (! deltaPlus) {
>             deltahh = -deltahh;
>             deltamm = -deltamm;
>           }
>           // millisecond offset
>           int offsetFromUTC = ((deltahh * 60) + deltamm) * 60 * 1000;
>           c.set(c.ZONE_OFFSET, offsetFromUTC);
>         }
>       )
>     )?
>   )?
>   {
>     return c.getTime();
>   }
> }
> 
> int millis() :
> {
>   Token t;
> }
> {
>   t = <MILLIS> {
>     return Integer.parseInt(t.image);
>   }
> }
> 
> int twod() :
> {
>   Token t;
> }
> {
>   t = <TWOD> {
>     return Integer.parseInt(t.image);
>   }
> }
> 
> int year() :
> {
>   Token t;
>   boolean positive = true;
> }
> {
>   (
>     <DASH>
>     {
>       positive = false;
>     }
>   )?
>   t = <YEAR> {
>     int year = Integer.parseInt(t.image);
>     if (year == 0) {
>       throw new IllegalArgumentException("0000 is not a legal year");
>     }
>     return positive ? year : -year;
>   }
> }
> > --
> To unsubscribe, e-mail:  
> <ma...@jakarta.apache.org>
> For additional commands, e-mail:
<ma...@jakarta.apache.org>


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by "Eric D. Friedman" <er...@conveysoftware.com>.
Instead of reinventing the wheel for representing dates, how about
using an existing standard?  ISO 8601 defines a simple lexical
representation for dates, times (with optional millisecond precision),
and timezones that is easy to implement.  This is what's used in the
XML Schema "dateTime" datatype.

A summary of the ISO 8601 notation is available here:
http://www.cl.cam.ac.uk/~mgk25/iso-time.html

The documentation for the XML Schema dateTime datatype is here:
http://www.w3.org/TR/xmlschema-2/#dateTime

I whipped up a JavaCC parser to handle this lexical representation (see
attachment).

Note that for this to be useful in QueryParser, it's going to need its
own lexical state.  This makes sense anyway, since it would be a
mistake to have the query syntax infer magical properties about strings
that appear to be dates.  Better is to have a keyword in the query
syntax that introduces a date value:  something like date(<VALUE>)
would work.  So would to_date(<VALUE>) for those who know SQL. I would
have suggested date:<VALUE> but I think that already means something in
the QueryParser's lexical specification. (I don't actually use
QueryParser because the patches I've submitted previously haven't made
it in yet, and until they do, QP is fatally crippled for my purposes).

Eric

On Sun, 2 Jun 2002, Peter Carlson wrote:

> I like this idea of [GOOP:GOOP] as it gives the most flexibility. However,
> this requires the field to have a known characteristic like a date field,
> number field or text field correct? If you just use the static Field.Date
> this would require adding a new attribute the field class? I like this idea
> but I don¹t know the difficulty / backward compatibility issues.
>
> If the extra field attribute is too difficult, then I suggest we use the
> nnnn-nn-nn format method so we can use the pattern to determine the data
> type.
>
> For number fields, should this support only integers, or decimal numbers
> too?
>
> I don't think we should use the : character, because we probably want to
> support time formats in the date format. Something like 03/01/2001 at
> 00:01:00. Maybe something like ">" or "|" or even "->" ?
>
> Also, inclusive vs. exclusive should be accounted for with the [ vs {
> characters.  I think this might already be done, but just wanted to throw it
> out there.
>
> --Peter
>
>
> On 6/2/02 2:13 AM, "Brian Goetz" <br...@quiotix.com> wrote:
>
> >>> How about:
> >>>
> >>>  DATE = nnnn-nn-nn
> >>>  NUMBER = n*
> >>>  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
> >>>
> >>> An alternate, less parse-oriented approach would be this:
> >>>   RANGE = [ GOOP : GOOP ]
> >>> where
> >>>   GOOP = any string of letters/numbers not containing : or ].
> >>
> >> I'd go for the first one as it's more explicit.  However, perhaps the
> >> second approach is more extensible?
> >
> > When I first did the query parser, I defined terms by inclusion
> > (stating valid characters) instead of exclusion (excluding non-term
> > characters.)  Turns out I missed quite a few in the first go around,
> > which taught me the lesson (again) that sometimes trying to be too
> > specific is a rats nest.  What about dates like 02-Mai-2002 (not a
> > typo, french for May)?  Letting DateFormat figure it out has some
> > merit.
> >
> >> DateField(Date) and NumberField(int) sounds right, but wouldn't Field
> >> class make more sense?
> >
> > I had in mind static methods of Field, just like Field.Text --
> > Field.Date, Field.Number.   Sorry if that wasn't clear.  This seems
> > an easy addition.
> >
> > --
> > To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> > For additional commands, e-mail: <ma...@jakarta.apache.org>
> >
> >
>
>
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
>

Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
I like this idea of [GOOP:GOOP] as it gives the most flexibility. However,
this requires the field to have a known characteristic like a date field,
number field or text field correct? If you just use the static Field.Date
this would require adding a new attribute the field class? I like this idea
but I don¹t know the difficulty / backward compatibility issues.

If the extra field attribute is too difficult, then I suggest we use the
nnnn-nn-nn format method so we can use the pattern to determine the data
type.

For number fields, should this support only integers, or decimal numbers
too? 

I don't think we should use the : character, because we probably want to
support time formats in the date format. Something like 03/01/2001 at
00:01:00. Maybe something like ">" or "|" or even "->" ?

Also, inclusive vs. exclusive should be accounted for with the [ vs {
characters.  I think this might already be done, but just wanted to throw it
out there.

--Peter


On 6/2/02 2:13 AM, "Brian Goetz" <br...@quiotix.com> wrote:

>>> How about:
>>> 
>>>  DATE = nnnn-nn-nn
>>>  NUMBER = n*
>>>  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
>>> 
>>> An alternate, less parse-oriented approach would be this:
>>>   RANGE = [ GOOP : GOOP ]
>>> where
>>>   GOOP = any string of letters/numbers not containing : or ].
>> 
>> I'd go for the first one as it's more explicit.  However, perhaps the
>> second approach is more extensible?
> 
> When I first did the query parser, I defined terms by inclusion
> (stating valid characters) instead of exclusion (excluding non-term
> characters.)  Turns out I missed quite a few in the first go around,
> which taught me the lesson (again) that sometimes trying to be too
> specific is a rats nest.  What about dates like 02-Mai-2002 (not a
> typo, french for May)?  Letting DateFormat figure it out has some
> merit.
> 
>> DateField(Date) and NumberField(int) sounds right, but wouldn't Field
>> class make more sense?
> 
> I had in mind static methods of Field, just like Field.Text --
> Field.Date, Field.Number.   Sorry if that wasn't clear.  This seems
> an easy addition.
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> > How about:
> > 
> >  DATE = nnnn-nn-nn
> >  NUMBER = n*
> >  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
> > 
> > An alternate, less parse-oriented approach would be this: 
> >   RANGE = [ GOOP : GOOP ]
> > where
> >   GOOP = any string of letters/numbers not containing : or ].  
> 
> I'd go for the first one as it's more explicit.  However, perhaps the
> second approach is more extensible?  

When I first did the query parser, I defined terms by inclusion
(stating valid characters) instead of exclusion (excluding non-term
characters.)  Turns out I missed quite a few in the first go around,
which taught me the lesson (again) that sometimes trying to be too
specific is a rats nest.  What about dates like 02-Mai-2002 (not a
typo, french for May)?  Letting DateFormat figure it out has some
merit.

> DateField(Date) and NumberField(int) sounds right, but wouldn't Field
> class make more sense?

I had in mind static methods of Field, just like Field.Text -- 
Field.Date, Field.Number.   Sorry if that wasn't clear.  This seems
an easy addition. 

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Otis Gospodnetic <ot...@yahoo.com>.
Hello,

--- Brian Goetz <br...@quiotix.com> wrote:
> > I think we should have a Date format or formats and then convert
> them use
> > the DateField to the Lucene date format.
> 
> OK, Common date formats could include:
>   mm/dd/yy 
>   mm/dd/yyyy
>   yyyy-mm-dd (2002-05-01)
>   yyyy-mmm-dd (2002-May-01)
> 
> The latter two seem more sensible as they are non US-centric, but
> they
> use the dash character which is also used by the range brackets...
> My feeling is we should pick ONE, and stick to it.  
> 
> Maybe we should ditch - as the range operator -- if we want to
> support
> numeric ranges, that makes negative numbers more complicated too. 

Sounds like it would eliminate some headaches, yes.

> yyyy-mm-dd seems to be the most common and sensible date format.

I agree.

> How about:
> 
>  DATE = nnnn-nn-nn
>  NUMBER = n*
>  RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]
> 
> An alternate, less parse-oriented approach would be this: 
>   RANGE = [ GOOP : GOOP ]
> where
>   GOOP = any string of letters/numbers not containing : or ].  

I'd go for the first one as it's more explicit.  However, perhaps the
second approach is more extensible?  If we come up with anything else
that can use range searches, besides dates and numbers, the latter
approach would not require query parser modification, if I understand
things correctly.

> Then we could use DateFormat to try to convert it into a date.  If
> DateFormat failed, we could try NumberFormat.
> 
> In any case, the analyzer should not be called.  
> 
> 
> Regarding Date and Number fields, I'd like to make the handling of
> indexed
> date and number fields more automatic.  Rather than calling 
>   DateField.dateToString() 
> and indexing that, I'd prefer to have appropriate static methods on
> Document
> like DateField(Date) and NumberField(int).  

DateField(Date) and NumberField(int) sounds right, but wouldn't Field
class make more sense?

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> I think we should have a Date format or formats and then convert them use
> the DateField to the Lucene date format.

OK, Common date formats could include:
  mm/dd/yy 
  mm/dd/yyyy
  yyyy-mm-dd (2002-05-01)
  yyyy-mmm-dd (2002-May-01)

The latter two seem more sensible as they are non US-centric, but they
use the dash character which is also used by the range brackets...
My feeling is we should pick ONE, and stick to it.  

Maybe we should ditch - as the range operator -- if we want to support
numeric ranges, that makes negative numbers more complicated too. 

yyyy-mm-dd seems to be the most common and sensible date format.  How
about:

 DATE = nnnn-nn-nn
 NUMBER = n*
 RANGE = [ DATE : DATE ] | [ NUMBER : NUMBER ]

An alternate, less parse-oriented approach would be this: 
  RANGE = [ GOOP : GOOP ]
where
  GOOP = any string of letters/numbers not containing : or ].  

Then we could use DateFormat to try to convert it into a date.  If
DateFormat failed, we could try NumberFormat.

In any case, the analyzer should not be called.  


Regarding Date and Number fields, I'd like to make the handling of indexed
date and number fields more automatic.  Rather than calling 
  DateField.dateToString() 
and indexing that, I'd prefer to have appropriate static methods on Document
like DateField(Date) and NumberField(int).  


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Otis Gospodnetic <ot...@yahoo.com>.
> Date formats to include might be
> Mm/dd/yyyy where these are all <digit> (this is very US centric but
> could
> easily be converted for other countries)
> 
> MMM dd, yyyy (where MMM is JAN, FEB, ...)
> 
> Yyyy/mm/dd (be able to know by 4 <digit> start)
> 
> Others????

dd.MM.YYYY.
YYYY-MM-dd

Otis


__________________________________________________
Do You Yahoo!?
Yahoo! - Official partner of 2002 FIFA World Cup
http://fifaworldcup.yahoo.com

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
Sounds good.

I think we should have a Date format or formats and then convert them use
the DateField to the Lucene date format.

Date formats to include might be
Mm/dd/yyyy where these are all <digit> (this is very US centric but could
easily be converted for other countries)

MMM dd, yyyy (where MMM is JAN, FEB, ...)

Yyyy/mm/dd (be able to know by 4 <digit> start)

Others????


Number would be interesting if we could define a NumberField in lucene. That
is potentially pad the number to a max length (say up to a 16 digits is
supported). That way if it's just a set of digits then we could convert it
using the NumberField.

If they don't match one of the defined formats, then I think we should just
leave them as is. If we tokenize it and it produces multiple tokens then how
would the RangeQuery work?

Thoughts?

--Peter


On 6/1/02 3:54 PM, "Brian Goetz" <br...@quiotix.com> wrote:

> The technical part is generally pretty easy, once we decide what we
> actually want to do.  The problem is when we don't really know
> what we want to accept.
> 
> Lets start with coming up with a rough syntax definition of what
> constitutes an allowable range.  Numbers?  Dates?  Date formats?  


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> This is something I'd like to get working on so if you have any pointers I'd
> spend the time to get the work done.

The technical part is generally pretty easy, once we decide what we
actually want to do.  The problem is when we don't really know
what we want to accept.  

Lets start with coming up with a rough syntax definition of what
constitutes an allowable range.  Numbers?  Dates?  Date formats?  



--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Peter Carlson <ca...@bookandhammer.com>.
That sounds great.

I really know nothing about JavaCC so I was trying to figure a way of not
using the StandardTokenizer to tokenize the code.

This is something I'd like to get working on so if you have any pointers I'd
spend the time to get the work done.

Thanks

--Peter

On 6/1/02 3:43 PM, "Brian Goetz" <br...@quiotix.com> wrote:

>> What do people think the right way to handle this issue for the range
>> queries? My suggestion is to do a indexOf() for "-" and create the one or
>> two tokens. That is, don't use the analyzer to determine what the tokens are
>> here. Is there a problem with this?
> 
> We can also use JavaCC's lexical modes to have different sets of rules
> for different tokens.
> 
> The range stuff always felt to me like it was nailed onto the side of
> the query parser.  How about we step back and define a formal syntax
> for acceptable range queries, and then approach that as a parsing
> problem, instead of hacking this hack further?
> 
> --
> To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
> For additional commands, e-mail: <ma...@jakarta.apache.org>
> 
> 


--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>


Re: Bug? QueryParser may not correctly interpret RangeQuery text

Posted by Brian Goetz <br...@quiotix.com>.
> What do people think the right way to handle this issue for the range
> queries? My suggestion is to do a indexOf() for "-" and create the one or
> two tokens. That is, don't use the analyzer to determine what the tokens are
> here. Is there a problem with this?

We can also use JavaCC's lexical modes to have different sets of rules
for different tokens.

The range stuff always felt to me like it was nailed onto the side of
the query parser.  How about we step back and define a formal syntax
for acceptable range queries, and then approach that as a parsing
problem, instead of hacking this hack further?

--
To unsubscribe, e-mail:   <ma...@jakarta.apache.org>
For additional commands, e-mail: <ma...@jakarta.apache.org>