You are viewing a plain text version of this content. The canonical link for it is here.
Posted to solr-user@lucene.apache.org by Brian Whitman <br...@variogr.am> on 2007/05/10 18:59:59 UTC

dates & times

After writing my 3rd parser in my third scripting language in so many  
months to go from unix timestamps to "Solr Time" (8601) I have to  
ask: shouldn't the date/time field type be more resilient? I assume  
there's a good reason that it's 8601 internally, but certainly it  
would be excellent for Solr to transcode different types into Solr Time.

My main problem (as a normal Solr end user) is that it's hard to do  
math directly on 8601 dates or really parse them without specific  
packages. My XSL 2.0 parsers don't even like it without some  
massaging (forget about XSL 1.0.) UNIX time (seconds since the epoch)  
is super easy, as are sortable delimitable strings like  
"20070510125403."

I'm not advocating replacing 8601 as the known good Solr Time, just  
that some leeway be given in the parser to accept unix time or  
something else and the conversion to 8601 happens internally. And a  
further dream is to have a strftime formatter in solrconfig for the  
query response, so I can always have my date fields come back as "May  
10th, 2007, 12:58pm."

-Brian








Re: dates & times

Posted by Ryan McKinley <ry...@gmail.com>.
> 
> (In general a "DateTranslatingTokenFilter" class would be a pretty cool
> addition to Lucene, it could as constructor args two DateFormatters (one
> for parsing the incoming tokens, and one for formating the outgoing

If this happens, it would be nice (perhaps overkill) to have a "chronic" 
input filter:

http://chronic.rubyforge.org/

the java port:
https://jchronic.dev.java.net/

---

brian, for a quick easy solution, if you find working with unix 
timestamps easier, perhaps just want to to put in dates as a 
SortableLongField and deal with the formatting that way.

Re: dates & times

Posted by Chris Hostetter <ho...@fucit.org>.
: The right approach for more flexible date parsing is probably to add
: more functionality to the date field and configure via optional
: attributes.

Adding configuration options to DateField seems like it might ultimately
be the right choice for changing the *internal* format, but assuming we
want to keep the internal representation of DateField fixed and
unconfigurable for the time being and address the the various *external*
formatting issues i imagine the simplest things to tackle this (in a way
that is consistent with the other datatypes) would be...

1) change DateField to support Analyzers.  that way you could have
seperate analyzers for indexing vs querying just like a text field (so you
could for example send Solr seconds since epoch when indexing, and
query query using MM/DD/YYYY)

The Analyzers used would be responsible for producing Tokens which match
what values the current DateField.toInternal() already consideres legal
(either a DateMath string or an iso8601 string).

(In general a "DateTranslatingTokenFilter" class would be a pretty cool
addition to Lucene, it could as constructor args two DateFormatters (one
for parsing the incoming tokens, and one for formating the outgoing
tokens) and a boolean indicating wether it's job was to replace matching
tokens or inject duplicate tokens in the same position ... maybe another
option indicating wether incoming Tokens that can't be parsed should be
striped or passed through ... the idea being that for something like
DateFiled you would use KeywordTokenizer along with an instance of this to
parse whatever format you wanted -- but when parsing generic text you
might have several of these TokenFilters configured with differnet
DateFormatters so if they see a Token in the text that matches a known
DateFormat they could inject the name of the month, or the day of hte week
into the text at the same position.)


2) add options to the various QueryResponseWriters to control which format
they use when writting fields out. .. in the case of XmlResposneWriter it
would still produce a <date> tag, but the value would be formated
according to the configuration.


-Hoss


Re: dates & times

Posted by Yonik Seeley <yo...@apache.org>.
On 5/10/07, Brian Whitman <br...@variogr.am> wrote:
> On May 10, 2007, at 2:30 PM, Chris Hostetter wrote:
> > Questions like these are whiy I'm glad Solr currently keeps it
> > simple and
> > makes people deal in absolutes .. less room for confusion  :)
>
> I get all that, thanks for the great explanation.
>
> I imagine most of my problems can be solved with a custom field
> analyzer (converting other date strings to 8601 during indexing) and
> then XSL on the select?q= end (which we do anyway.) In other words,
> leaving core solr absolute with an option for different date
> analyzers. I see the need to not clutter it up, especially at this
> stage.
>
> What would, say, a filter that converted unix timestamps to 8601
> before indexing as a solr.DateField look like? Is that a custom
> filter, or a tokenizer?

That would be a custom filter.... which is currently only supported by
text fields, so the XML output would be <str> instead of <date> (if
that matters to you).

One could also just store seconds or milliseconds in an int or long
field.  That's fine for small devel teams, but not ideal since it's a
bit less expressive.

The right approach for more flexible date parsing is probably to add
more functionality to the date field and configure via optional
attributes.

-Yonik

Re: dates & times

Posted by Brian Whitman <br...@variogr.am>.
On May 10, 2007, at 2:30 PM, Chris Hostetter wrote:
> Questions like these are whiy I'm glad Solr currently keeps it  
> simple and
> makes people deal in absolutes .. less room for confusion  :)

I get all that, thanks for the great explanation.

I imagine most of my problems can be solved with a custom field  
analyzer (converting other date strings to 8601 during indexing) and  
then XSL on the select?q= end (which we do anyway.) In other words,  
leaving core solr absolute with an option for different date  
analyzers. I see the need to not clutter it up, especially at this  
stage.

What would, say, a filter that converted unix timestamps to 8601  
before indexing as a solr.DateField look like? Is that a custom  
filter, or a tokenizer?





RE: dates & times

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
Regarding Hoss's points about the internal format, resolution of
date-times, etc.: maybe a good starting point would be to implement the
date-time algorithms of XML Schema
(http://www.w3.org/TR/xmlschema-2/#isoformats), where these behaviors
are spelled out in reasonably precise terms. There must be code
somewhere that Solr could steal to help with this. This would mesh well
with XSLT 2.0, and presumably other modern XML environments.

peter

-----Original Message-----
From: Chris Hostetter [mailto:hossman_lucene@fucit.org] 
Sent: Thursday, May 10, 2007 12:30 PM
To: solr-user@lucene.apache.org
Subject: Re: dates & times


: It's more than string processing, anyway. I would want to convert the
: Solr Time 2007-03-15T00:41:5:2Z to "March 15th, 2007" in a web app.
: I'd also like to say 'Posted 3 days ago." In my vision of things,
: that work is done on Solr's side. (The former case with a strftime
: type formatter in solrconfig, the latter by having strftime return
: the day number this year.)

One of the early architecture/design principles of the Solr "search"
APIs was "compute secondary info about a result if it's more efficient
or easier to compute in Solr then it would be for a client to do it" --
DocSet caches, facet counts, and sorting/pagination being great examples
of things where Solr can do less "work" to get the same info out of raw
data then a client app would because of it's low level access to the
data, and becuase of how much data would need to go over the wire for
the client to do the same computation. ... that's largely just a lit bit
of historic trivial however, Solr has a lot of features now which might
not hold up to the yard stick, but i mention it only to clarify one of
hte reasons Solr didnt' have more 'configurable" date formatting to
start with.

it has been on the TaskList since the start of incubation however...

  * a DateTime field (or Query Parser extension) that allows flexible
    input for easier human entered queries
  * allow alternate format for date output to ease client creation of
    date objects?

One of hte reasons i dont' think anyone has tackled them yet is because
it's hard to get a holistic view of a solution, because there are really
several loosely related problems with date formatting issues:

The first is a discusion of the "internal format" and what resolution
the dates are stored at in the index itself.  if you *know* that you
never plan on querying with anything more fine grained then day
resolution, storing your dates with only day resolution can make your
index a lot smaller (and make date searches a lot faster).  with the
current DateField the same performance benefits can be achieved by
"rounding" your dates before indexing them, but if we were to make it a
config option on DateField itself to automaticly round, we would need to
take this info into account when parsing updates -- should the client be
exepcted to know what precision each date field uses?  do they send
dates expressed using the "internal" format, or as fully qualified
times?  is it an error/warning to attempt to index more datetime
precision then a field supports?

The second is a discussion of "external format" (which seems to be what
you are mostly discussing)  the most trivial way to address this would
be options on the ResponseWriters that allow them to be configured with
DateFormater Strings they would use to process any date they return ..
but that raises questions about the QueryParsing aspect as well ...
should date formating be a property of the response, or a property of
the request, such that both input and output formats are identicle?

Third is how the discussions of the internal format and the external
format shouldn't be treated completely indepndent.  it's tempting to say
that there will be a clean abstraction between the two, that all client
interaction will be done using configured "external" formater(s) to
create internal java Date objects, which will then be translated back to
Strings by an "internal" formater for the purpose of indexing (and
querying) but what happens when a query expresses a date range too
precise for the granularity expressed by the internal format? do we
match nothing/everything? ... what if the indexed granularity is *more*
recised then the uery graunlarity .. how do we know if a range query
between March 6, 2007 and May 10, 2007 on a field that stores
millisencond granularity is suppose to go from the first millisecond of
each day or the last?



Questions like these are whiy I'm glad Solr currently keeps it simple
and makes people deal in absolutes .. less room for confusion  :)


-Hoss


Re: dates & times

Posted by Chris Hostetter <ho...@fucit.org>.
: It's more than string processing, anyway. I would want to convert the
: Solr Time 2007-03-15T00:41:5:2Z to "March 15th, 2007" in a web app.
: I'd also like to say 'Posted 3 days ago." In my vision of things,
: that work is done on Solr's side. (The former case with a strftime
: type formatter in solrconfig, the latter by having strftime return
: the day number this year.)

One of the early architecture/design principles of the Solr "search"
APIs was "compute secondary info about a result if it's more
efficient or easier to compute in Solr then it would be for a client to do
it" -- DocSet caches, facet counts, and sorting/pagination being
great examples of things where Solr can do less "work" to get the same
info out of raw data then a client app would because of it's low level
access to the data, and becuase of how much data would need to go over the
wire for the client to do the same computation. ... that's largely just a
lit bit of historic trivial however, Solr has a lot of features now which
might not hold up to the yard stick, but i mention it only to clarify one
of hte reasons Solr didnt' have more 'configurable" date formatting to
start with.

it has been on the TaskList since the start of incubation however...

  * a DateTime field (or Query Parser extension) that allows flexible
    input for easier human entered queries
  * allow alternate format for date output to ease client creation of
    date objects?

One of hte reasons i dont' think anyone has tackled them yet is because
it's hard to get a holistic view of a solution, because there are really
several loosely related problems with date formatting issues:

The first is a discusion of the "internal format" and what resolution the
dates are stored at in the index itself.  if you *know* that you never
plan on querying with anything more fine grained then day resolution,
storing your dates with only day resolution can make your index a lot
smaller (and make date searches a lot faster).  with the current DateField
the same performance benefits can be achieved by "rounding" your dates
before indexing them, but if we were to make it a config option on
DateField itself to automaticly round, we would need to take this info
into account when parsing updates -- should the client be exepcted to know
what precision each date field uses?  do they send dates expressed using
the "internal" format, or as fully qualified times?  is it an
error/warning to attempt to index more datetime precision then a field
supports?

The second is a discussion of "external format" (which seems to be what
you are mostly discussing)  the most trivial way to address this would be
options on the ResponseWriters that allow them to be configured with
DateFormater Strings they would use to process any date they return .. but
that raises questions about the QueryParsing aspect as well ... should
date formating be a property of the response, or a property of the
request, such that both input and output formats are identicle?

Third is how the discussions of the internal format and the external
format shouldn't be treated completely indepndent.  it's tempting to say
that there will be a clean abstraction between the two, that all client
interaction will be done using configured "external" formater(s) to create
internal java Date objects, which will then be translated back to Strings
by an "internal" formater for the purpose of indexing (and querying) but
what happens when a query expresses a date range too precise for the
granularity expressed by the internal format? do we match
nothing/everything? ... what if the indexed granularity is *more* recised
then the uery graunlarity .. how do we know if a range query between March
6, 2007 and May 10, 2007 on a field that stores millisencond granularity
is suppose to go from the first millisecond of each day or the last?



Questions like these are whiy I'm glad Solr currently keeps it simple and
makes people deal in absolutes .. less room for confusion  :)


-Hoss


RE: dates & times

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
Minor clarification re the exslt license: that applies to the external
exslt implementations, which you only need if your xsl engine doesn't
support exslt natively. Since Xalan does, at least mostly, it's all
already there in Solr.

I agree that more flexible date-time parsing of input to Solr is also
desirable.

Peter

-----Original Message-----
From: Brian Whitman [mailto:brian.whitman@variogr.am] 
Sent: Thursday, May 10, 2007 12:25 PM
To: solr-user@lucene.apache.org
Subject: Re: dates & times

> You can get at some of this functionality in the built-in xslt 1.0 
> engine (Xalan) by using the e-xslt date-time extensions: see 
> http://exslt.org/date/index.html, and for Xalan's implementation see 
> http://xml.apache.org/xalan-j/extensionslib.html#exslt .

The exslt stuff looks good, thanks! I'll have to try it out. That's only
one direction though, still want parsing of unix timestamp-like formats
into the indexer on doc adds and updates.

Just FYi the license for the exslt stuff seems OK w/ the APL: http://
lists.fourthought.com/pipermail/exslt-manage/2004-June/000603.html
So if it works out we might want to put the date/time xsl stuff in the
solr distribution  in lieu of shipping with a XSL 2.0 processor.





>>
>> Those are interesting ideas and it probably would not be difficult to

>> create a patch if you were interested, but I'm curious:  What about 
>> XSL makes what seems to me an elementary string-processing task so 
>> difficult?
>>
>
> Well, XSL 1.0 (which is the one that "comes for free" with Solr/java) 
> doesn't handle dates and times at all. XSL 2.0 handles it well enough,

> but it's only supported through a GPL jar, which we can't distribute.
>
> It's more than string processing, anyway. I would want to convert the 
> Solr Time 2007-03-15T00:41:5:2Z to "March 15th, 2007" in a web app.
> I'd also like to say 'Posted 3 days ago." In my vision of things, that

> work is done on Solr's side. (The former case with a strftime type 
> formatter in solrconfig, the latter by having strftime return the day 
> number this year.)
>
>
>
>
>

--
http://variogr.am/
brian.whitman@variogr.am




Re: dates & times

Posted by Brian Whitman <br...@variogr.am>.
> You can get at some of this functionality in the built-in xslt 1.0
> engine (Xalan) by using the e-xslt date-time extensions: see
> http://exslt.org/date/index.html, and for Xalan's implementation see
> http://xml.apache.org/xalan-j/extensionslib.html#exslt .

The exslt stuff looks good, thanks! I'll have to try it out. That's  
only one direction though, still want parsing of unix timestamp-like  
formats into the indexer on doc adds and updates.

Just FYi the license for the exslt stuff seems OK w/ the APL: http:// 
lists.fourthought.com/pipermail/exslt-manage/2004-June/000603.html
So if it works out we might want to put the date/time xsl stuff in  
the solr distribution  in lieu of shipping with a XSL 2.0 processor.





>>
>> Those are interesting ideas and it probably would not be difficult to
>> create a patch if you were interested, but I'm curious:  What about
>> XSL makes what seems to me an elementary string-processing task so
>> difficult?
>>
>
> Well, XSL 1.0 (which is the one that "comes for free" with Solr/java)
> doesn't handle dates and times at all. XSL 2.0 handles it well enough,
> but it's only supported through a GPL jar, which we can't distribute.
>
> It's more than string processing, anyway. I would want to convert the
> Solr Time 2007-03-15T00:41:5:2Z to "March 15th, 2007" in a web app.
> I'd also like to say 'Posted 3 days ago." In my vision of things, that
> work is done on Solr's side. (The former case with a strftime type
> formatter in solrconfig, the latter by having strftime return the day
> number this year.)
>
>
>
>
>

--
http://variogr.am/
brian.whitman@variogr.am




RE: dates & times

Posted by "Binkley, Peter" <Pe...@ualberta.ca>.
You can get at some of this functionality in the built-in xslt 1.0
engine (Xalan) by using the e-xslt date-time extensions: see
http://exslt.org/date/index.html, and for Xalan's implementation see
http://xml.apache.org/xalan-j/extensionslib.html#exslt . There are some
examples here:
http://www-128.ibm.com/developerworks/library/x-exslt.html . I haven't
tried this in Solr but I don't think there's any reason why it wouldn't
work; I've used it in other Xalan-J environments, notably Cocoon. 

Peter

-----Original Message-----
From: Brian Whitman [mailto:brian.whitman@variogr.am] 
Sent: Thursday, May 10, 2007 11:49 AM
To: solr-user@lucene.apache.org
Subject: Re: dates & times

>
> Those are interesting ideas and it probably would not be difficult to 
> create a patch if you were interested, but I'm curious:  What about 
> XSL makes what seems to me an elementary string-processing task so 
> difficult?
>

Well, XSL 1.0 (which is the one that "comes for free" with Solr/java)
doesn't handle dates and times at all. XSL 2.0 handles it well enough,
but it's only supported through a GPL jar, which we can't distribute.

It's more than string processing, anyway. I would want to convert the
Solr Time 2007-03-15T00:41:5:2Z to "March 15th, 2007" in a web app.  
I'd also like to say 'Posted 3 days ago." In my vision of things, that
work is done on Solr's side. (The former case with a strftime type
formatter in solrconfig, the latter by having strftime return the day
number this year.)






Re: dates & times

Posted by Brian Whitman <br...@variogr.am>.
>
> Those are interesting ideas and it probably would not be difficult to
> create a patch if you were interested, but I'm curious:  What about
> XSL makes what seems to me an elementary string-processing task so
> difficult?
>

Well, XSL 1.0 (which is the one that "comes for free" with Solr/java)  
doesn't handle dates and times at all. XSL 2.0 handles it well  
enough, but it's only supported through a GPL jar, which we can't  
distribute.

It's more than string processing, anyway. I would want to convert the  
Solr Time 2007-03-15T00:41:5:2Z to "March 15th, 2007" in a web app.  
I'd also like to say 'Posted 3 days ago." In my vision of things,  
that work is done on Solr's side. (The former case with a strftime  
type formatter in solrconfig, the latter by having strftime return  
the day number this year.)






Re: dates & times

Posted by Mike Klaas <mi...@gmail.com>.
On 5/10/07, Brian Whitman <br...@variogr.am> wrote:
> After writing my 3rd parser in my third scripting language in so many
> months to go from unix timestamps to "Solr Time" (8601) I have to
> ask: shouldn't the date/time field type be more resilient? I assume
> there's a good reason that it's 8601 internally, but certainly it
> would be excellent for Solr to transcode different types into Solr Time.
>
> My main problem (as a normal Solr end user) is that it's hard to do
> math directly on 8601 dates or really parse them without specific
> packages. My XSL 2.0 parsers don't even like it without some
> massaging (forget about XSL 1.0.) UNIX time (seconds since the epoch)
> is super easy, as are sortable delimitable strings like
> "20070510125403."

I'm not sure what delimitable means, but Solr datetimes _are_
essentially sortable inverse-magnitude like the above, with a few
punctuation symbols thrown in.  I have no XSLT-fu, but is it not
possible to do regexp-replace s/[TZ:-]//g on the solrdate to get the
above?

> I'm not advocating replacing 8601 as the known good Solr Time, just
> that some leeway be given in the parser to accept unix time or
> something else and the conversion to 8601 happens internally. And a
> further dream is to have a strftime formatter in solrconfig for the
> query response, so I can always have my date fields come back as "May
> 10th, 2007, 12:58pm."

Those are interesting ideas and it probably would not be difficult to
create a patch if you were interested, but I'm curious:  What about
XSL makes what seems to me an elementary string-processing task so
difficult?

regards
-Mike