You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by David Smiley <da...@gmail.com> on 2016/03/22 03:32:40 UTC

Solr date formatting/parsing; upgrade to java.time

I recently discovered that Solr doesn't support negative years (i.e. BC)
and I set about working on fixing it.  It's just a formatting issue (date
-> string), from what I see.  I could just fix this but then I saw that we
haven't yet moved on to the Java 8 java.time API (a derivative of Joda
time), and I'd rather simplify Solr's date formatting & parsing code using
this API instead of adding just one fix for something java.time already
supports.  I also see we annoyingly have 2 date utility classes:
DateFormatUtil and DateUtil.

I started a quick hack to cut over DateFormatUtil's formatting to this
one-liner:  DateTimeFormatter.ISO_INSTANT.format(d.toInstant());    and
similarly there's a one-liner for parsing.

I'd love to just cut over to this but there are some slight differences we
would see and I want to get people's opinion if any of these differences
are a blocker:
* Milliseconds are 0 padded to 3 digits if the milliseconds is non-zero.
Thus 30 milliseconds will have ".030" added on.  Our current formatting
code does ".03".
* Dates with years after '9999' (i.e. 10000 and beyond, >= 5 digit years)
*must* have a leading '+' -- it is formatted with a "+" and if such a year
is parsed it *must* have a "+" or there is an exception.  Currently a '+'
would yield an exception and there is no "+" emitted.
* Of course as mentioned, currently we don't support negative years
(resulting in invisible errors mostly); we'd get this for free.

This stuff should matter so little to most apps that I hope there isn't
dissent on my proposal.  If there's a real reason to keep something
consistent with current behavior, I'm sure we could complicate things
further but it'd be great to simply use ISO_INSTANT exactly.

See https://issues.apache.org/jira/browse/SOLR-1899 and sub-task SOLR-2773
(the 2nd one) specifically.
-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Solr date formatting/parsing; upgrade to java.time

Posted by David Smiley <da...@gmail.com>.

Thanks for the historical context; it's nice to know.  You say it had to do
with XML but XML's choice was in turn following ISO specs.  So even though
the user might not be using XML, the format we've chosen is sound
regardless of the format.

JavaBin uses the 1970 epoch in milliseconds, and I suspect some other
formats like Avro might too though I don't know.  I checked Avro and it's
also milliseconds.  So no formatting issues there.  I suspect other
efficient protocols do likewise, to avoid formatting concerns and because,
of course, it's efficient to parse.

Regarding back-compat, yeah I get that there will be some users that need
to tweak something.  But this is a major release we're talking about.  With
major releases especially, users take their time to upgrade and don't do so
blindly.  IMO it's not worth making work for ourselves that will be
inevitable for a user any way.  Considering your input, perhaps on the
parsing side (but not formatting) I could go with a system property (not
request param) that enables Solr to use a DateTimeFormatter instance
constructed with what DateTimeFormatterBuilder.parseLenient().  It doesn't
parse 65 seconds (which I found an oddball example of in a test), but
parses non-padded time elements (e.g. "1" for January) and allows the "+"
to be optional for 5 digit years.  I just did some experimentation to
observe this.

I said parsing but not formatting because there's far fewer call sites in
my patch for Instant.parse and because of the "Robustness Principle":
https://en.wikipedia.org/wiki/Robustness_principle

Sorry, but I don't like the idea of supporting a list of formats.  Sounds
great for the extraction module or some sort of explicitly invoked utility
(e.g. in an URP), but not for everywhere.

~ David

On Fri, Mar 25, 2016 at 2:37 PM Chris Hostetter <ho...@fucit.org>
wrote:

>
> : I started a quick hack to cut over DateFormatUtil's formatting to this
> : one-liner:  DateTimeFormatter.ISO_INSTANT.format(d.toInstant());    and
>         ...
> : I'd love to just cut over to this but there are some slight differences
> we
> : would see and I want to get people's opinion if any of these differences
> : are a blocker:
>
> For some historical context: the current format was choosen back in the
> days when the *only* way to get data in or out of Solr was XML, and the
> format was picked to be consistent with the best standard available at the
> time for representing moments in time in XML.
>
> When we use javabin for communiating with clients, we use serialized
> versions of the Date objects that deserialize back to Date objects on the
> other side of the wire; and likewise if we added thrift, or avro, or
> whatever support for communicating with clients, i would be 100% in favor
> of following whatever standards those formats have for communicating
> moment in time info -- but even in all of those cases where there may be a
> protocol specific representation for moments in time over the wire, it
> still seems very important/useful for having a standard (default) format
> for dealing with dates represented as *strings* over the wire --
> particularly when dealing with query parsing.
>
>
> In general, i'm a *little* concerned about how tweaking the
> parsing/formatting of "dates as strings" will affect existing
> clients/users that are using generic/custom parsing/formatting code in
> various langauges (however small those tweaks may be)...
>
> : * Milliseconds are 0 padded to 3 digits if the milliseconds is non-zero.
> : Thus 30 milliseconds will have ".030" added on.  Our current formatting
> : code does ".03".
>
> ...probably won't impact existing many existing users since they alreayd
> have to be prepared to parse up to 3 decimals, and (i'm assuming) the new
> parser you're suggesting we use in solr will be forgiving if they send a
> string with less.  but if someone was obeying the existing spec
> religiously they may have errors if we return "*.900"
>
> : * Dates with years after '9999' (i.e. 10000 and beyond, >= 5 digit years)
> : *must* have a leading '+' -- it is formatted with a "+" and if such a
> year
> : is parsed it *must* have a "+" or there is an exception.  Currently a '+'
> : would yield an exception and there is no "+" emitted.
>
> ...this could also easily be problematic for some people in practice.
>
> : * Of course as mentioned, currently we don't support negative years
> : (resulting in invisible errors mostly); we'd get this for free.
>
> +1
>
>
> : dissent on my proposal.  If there's a real reason to keep something
> : consistent with current behavior, I'm sure we could complicate things
> : further but it'd be great to simply use ISO_INSTANT exactly.
>
> My personal preference would be to at least have some ability to enable a
> backcompat mode...
>
> 1) switch to the new syntax as you describe by default
> 2) support a request param option to force the old parsing code to be used
> (or perhaps just *emulated* by stripping off any leading + and trailing
> 0 when formatting, and adding them if needed when parsing) ... this would
> be a fall back option for clients whos existing code is brittle, and
> wouldn't have to be something we optimize heavily -- people whould be
> encouraged to swith to the new format ASAP.
> 3) after backporting to 6x, remove support for the request param override
> in master (so it's not supported at all starting in 7x)
>
> ...i think as a baseline, that would be awesome -- but I wonder if there
> are bigger questions we should be asking, and a better overall "API"
> for how date formatting/parsing rules are choosen on a per-request basis?
>
> ie: If we're going to change the rules of date parsing/formatting,
> should we rethink *all* the rules, rather then just tweaking one detail of
> rules that were written solely with communicating over XML in mind?
>
>
> I mean ... i'm way out of the loop on the state of art in java.time.*, but
> even w/o knowing what all is supported/recommend there, i have to wonder
> if this would be a good oportunity to add general support for an
> 'datetime.fmt' request param that can be multivalued to specify an ordered
> set of parsers to be used any tiem solr needs to parse a String specified
> by the client as a moment in time -- and using the first fmt specified any
> time a moment in time may need formatted to return to the user as a string
>
>         ?
>
>
>
> -Hoss
> http://www.lucidworks.com/
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
> --
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Re: Solr date formatting/parsing; upgrade to java.time

Posted by Chris Hostetter <ho...@fucit.org>.

: I started a quick hack to cut over DateFormatUtil's formatting to this
: one-liner:  DateTimeFormatter.ISO_INSTANT.format(d.toInstant());    and
	...
: I'd love to just cut over to this but there are some slight differences we
: would see and I want to get people's opinion if any of these differences
: are a blocker:

For some historical context: the current format was choosen back in the 
days when the *only* way to get data in or out of Solr was XML, and the 
format was picked to be consistent with the best standard available at the 
time for representing moments in time in XML.

When we use javabin for communiating with clients, we use serialized 
versions of the Date objects that deserialize back to Date objects on the 
other side of the wire; and likewise if we added thrift, or avro, or 
whatever support for communicating with clients, i would be 100% in favor 
of following whatever standards those formats have for communicating 
moment in time info -- but even in all of those cases where there may be a 
protocol specific representation for moments in time over the wire, it 
still seems very important/useful for having a standard (default) format 
for dealing with dates represented as *strings* over the wire -- 
particularly when dealing with query parsing.


In general, i'm a *little* concerned about how tweaking the 
parsing/formatting of "dates as strings" will affect existing 
clients/users that are using generic/custom parsing/formatting code in 
various langauges (however small those tweaks may be)...

: * Milliseconds are 0 padded to 3 digits if the milliseconds is non-zero.
: Thus 30 milliseconds will have ".030" added on.  Our current formatting
: code does ".03".

...probably won't impact existing many existing users since they alreayd 
have to be prepared to parse up to 3 decimals, and (i'm assuming) the new 
parser you're suggesting we use in solr will be forgiving if they send a 
string with less.  but if someone was obeying the existing spec 
religiously they may have errors if we return "*.900"

: * Dates with years after '9999' (i.e. 10000 and beyond, >= 5 digit years)
: *must* have a leading '+' -- it is formatted with a "+" and if such a year
: is parsed it *must* have a "+" or there is an exception.  Currently a '+'
: would yield an exception and there is no "+" emitted.

...this could also easily be problematic for some people in practice.

: * Of course as mentioned, currently we don't support negative years
: (resulting in invisible errors mostly); we'd get this for free.

+1


: dissent on my proposal.  If there's a real reason to keep something
: consistent with current behavior, I'm sure we could complicate things
: further but it'd be great to simply use ISO_INSTANT exactly.

My personal preference would be to at least have some ability to enable a 
backcompat mode...

1) switch to the new syntax as you describe by default
2) support a request param option to force the old parsing code to be used 
(or perhaps just *emulated* by stripping off any leading + and trailing 
0 when formatting, and adding them if needed when parsing) ... this would 
be a fall back option for clients whos existing code is brittle, and 
wouldn't have to be something we optimize heavily -- people whould be 
encouraged to swith to the new format ASAP.
3) after backporting to 6x, remove support for the request param override 
in master (so it's not supported at all starting in 7x)

...i think as a baseline, that would be awesome -- but I wonder if there 
are bigger questions we should be asking, and a better overall "API" 
for how date formatting/parsing rules are choosen on a per-request basis?

ie: If we're going to change the rules of date parsing/formatting, 
should we rethink *all* the rules, rather then just tweaking one detail of 
rules that were written solely with communicating over XML in mind?


I mean ... i'm way out of the loop on the state of art in java.time.*, but 
even w/o knowing what all is supported/recommend there, i have to wonder 
if this would be a good oportunity to add general support for an 
'datetime.fmt' request param that can be multivalued to specify an ordered 
set of parsers to be used any tiem solr needs to parse a String specified 
by the client as a moment in time -- and using the first fmt specified any 
time a moment in time may need formatted to return to the user as a string

	?



-Hoss
http://www.lucidworks.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org