You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@drill.apache.org by paul-rogers <gi...@git.apache.org> on 2017/09/13 18:24:20 UTC

[GitHub] drill issue #937: DRILL-5002: Using hive's date functions on top of date col...

Github user paul-rogers commented on the issue:

    https://github.com/apache/drill/pull/937
  
    The original description talks about data with local times. The TPC-H data has no TZ. Now, maybe we made one up in creating the Parquet files, but the original date just has dates without a tz.
    
    The fundamental issue is that if we have a tz-less date, 1994-08-12, say, then this *cannot* be converted to a UTC timestamp. Which of the 23+ time zones would we use? How would the client and server agree on the arbitrary tz? This is like saying that I have a measurement in miles, but we can store distances only in km, so I'll take my length of 5 miles and store it as 5 km, remembering that I'm using km as an alias for miles. Does not make sense.
    
    Your example uses `timestamp` constants. A timestamp is defined with a timezone, and so it fits Drill's model well. But, TPC dates don't have a timezone. See [the TPC-H spec](http://www.tpc.org/tpc_documents_current_versions/pdf/tpc-h_v2.17.2.pdf) which says:
    
    > Date is a value whose external representation can be expressed as YYYY-MM-DD, where all characters are numeric. A date must be able to express any day within at least 14 consecutive years. There is no requirement specific to the internal representation of a date.
    
    That is, TPC-H dates are not midnight on some date in some timezone, they are just dates. The cannot be converted to UTC. And so, they should not be subject to time zone shifting as tzs shift.
    
    My point here is that Hive (according to the docs) implements functions correctly: using tz-less dates. Drill tries to convert to a (fake) UTC and use time-based functions on that data. This is, at best, a hack, and at worst, leads to great complexity and incorrect results.
    
    That said, if all we have is km, and we can't do the miles-to-km conversion correctly, then we do need a way to know that a particular km value is actually miles. Similarly, using the current implementation, how will we know that a particular arbitrary-local-time-encoded-as-fake-UTC value really is local time vs. being an actual UTC time?
    
    All that said, if you fix makes the current implementation work better, then it is a good improvement.
    
    In the interests of moving ahead, let's table the basic discussion and just look at this one fix.


---