You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@hive.apache.org by "David Mollitor (Jira)" <ji...@apache.org> on 2021/03/03 19:54:00 UTC

[jira] [Commented] (HIVE-24814) Harmonize Hive Date-Time Formats

    [ https://issues.apache.org/jira/browse/HIVE-24814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17294781#comment-17294781 ] 

David Mollitor commented on HIVE-24814:
---------------------------------------

I wanted to document a few of the things I'm struggling with in doing this:

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Types

{quote}
Dates

DATE values describe a particular year/month/day, in the form YYYY-­MM-­DD. For example, DATE '2013-­01-­01'. Date types do not have a time of day component. The range of values supported for the Date type is 0000-­01-­01 to 9999-­12-­31, dependent on support by the primitive Java Date type.


=====

Timestamps

Supports traditional UNIX timestamp with optional nanosecond precision.

Supported conversions:

* Integer numeric types: Interpreted as UNIX timestamp in seconds
* Floating point numeric types: Interpreted as UNIX timestamp in seconds with decimal precision
* Strings: JDBC compliant java.sql.Timestamp format "YYYY-MM-DD HH:MM:SS.fffffffff" (9 decimal place precision)
Timestamps are interpreted to be timezoneless and stored as an offset from the UNIX epoch. Convenience UDFs for conversion to and from timezones are provided (to_utc_timestamp, from_utc_timestamp).

All existing datetime UDFs (month, day, year, hour, etc.) work with the TIMESTAMP data type.

* Timestamps in text files have to use the format yyyy-mm-dd hh:mm:ss[.f...]. If they are in another format, declare them as the appropriate type (INT, FLOAT, STRING, etc.) and use a UDF to convert them to timestamps.

* On the table level, alternative timestamp formats can be supported by providing the format to the SerDe property "timestamp.formats" (as of release 1.2.0 with HIVE-9298). For example, yyyy-MM-dd'T'HH:mm:ss.SSS,yyyy-MM-dd'T'HH:mm:ss.
{quote}

These are the definition of the data types, but I've found places in the code (and codified with unit tests):

* Some places in the code implicitly accepting ISO-8601 format: (yyyy-MM-dd'T'HH:mm:ss.SSS,yyyy-MM-dd'T'HH:mm:ss)
* Some places in the code allowing for more than _9 decimal place precision_ for the nano field. The values are silently truncated down to nine digits
* Accepting leap-year values 2021-02-29 (even though Feb 2021 is not a leap year and did not have 29 days)
* Accepting more generously formatted dates/times (2021-1-2 = 2021-01-02)

I'm working to accommodate the leniency of input formatting (such as single-digit values), but allowing for ISO-8601 without documentation, silent truncation, invalid dates, and all the rest are hard to want to live with.  It adds way too much complexity to the code and way too many test conditions.  As part of this work, I'm trying to tighten things to be in line with the stated documentation, though I know someone will scream about this because it may break previously allowed craziness.

> Harmonize Hive Date-Time Formats
> --------------------------------
>
>                 Key: HIVE-24814
>                 URL: https://issues.apache.org/jira/browse/HIVE-24814
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: David Mollitor
>            Assignee: David Mollitor
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> Harmonize Hive on JDK date-time formats courtesy of {{DateTimeFormatter}}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)