You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@drill.apache.org by "Paul Rogers (Jira)" <ji...@apache.org> on 2022/01/01 23:52:00 UTC

[jira] [Updated] (DRILL-8099) Parquet does not convert from Drill local timestamp to UTC

     [ https://issues.apache.org/jira/browse/DRILL-8099?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Paul Rogers updated DRILL-8099:
-------------------------------
    Summary: Parquet does not convert from Drill local timestamp to UTC  (was: Parquet record writer does not convert Dril local timestamp to UTC)

> Parquet does not convert from Drill local timestamp to UTC
> ----------------------------------------------------------
>
>                 Key: DRILL-8099
>                 URL: https://issues.apache.org/jira/browse/DRILL-8099
>             Project: Apache Drill
>          Issue Type: Bug
>    Affects Versions: 1.19.0
>            Reporter: Paul Rogers
>            Priority: Major
>
> Drill follows the old SQL engine convention to store the `TIMESTAMP` type in the local time zone. This is, of course, highly awkward in today's age when UTC is used as the standard timestamp in most products. However, it is how Drill works. (It would be great to add a `UTC_TIMESTAMP` type, but that is another topic.)
> Each reader or writer that works with files that hold UTC timestamps must convert to (reader) or from (writer) Drill's local-time timestamp. Otherwise, Drill works correctly only when the server time zone is set to UTC.
> Now, perhaps we can convince must shops to run their Drill server in UTC, or at least set the JVM timezone to UTC. However, this still leads developers in a lurch: if the development machine timezone is not UTC, then some tests fail. In particular:
> {{TestNestedDateTimeTimestamp.testNestedDateTimeCTASParquet}}
> The reason that the above test fails is that the generated Parquet writer code assumes (incorrectly) that the Drill timestamp is in UTC and so no conversion is needed to write that data into Parquet. In particular, in {{ParquetOutputRecordWriter.getNewTimeStampConverter()}}:
> {code:java}
>     reader.read(holder);
>     consumer.addLong(holder.value);
> {code}
> Basically, it takes a {{{}LocalDateTime{}}}, and formats it as a UTC timezone (using the "Z" suffix.) This is only valid if the machine is in the UTC time zone, which is why the test for this class attempts to force the local time zone to UTC, something that must users will not do.
> A consequence of this bug is that "round trip" CTAS will change dates by the UTC offset of the machine running the CTAS. In the Pacific time zone, each "round trip" subtracts 8 hours from the time. After three round trips, the "UTC" date in the Parquet file or JSON will be a day earlier than the original data. One might argue that this "feature" is not always helpful.
> This issue probably relates to a fix for DRILL-4203.
> h3. Parquet Timestamp Write
> Timestamp handling is quite confusing. Let's start by establishing the form of timestamp that Drill should write to Parquet. 
> Drill uses the deprecated Parquet {{OriginalType}} to describe types, and uses {{TIME_MILLIS}} to describe a timestamp. In {{ParquetTypeHelper}}:
> {code:java}
>             originalTypeMap.put(MinorType.TIMESTAMP, OriginalType.TIMESTAMP_MILLIS);
> {code}
> The {{OriginalType}} is converted, within Parquet, to the preferred {{LogicalTypeAnnotation}} class:
> {code:java}
>       case TIMESTAMP_MILLIS:
>         return timestampType(true, LogicalTypeAnnotation.TimeUnit.MILLIS);
> {code}
> Where the declaration is:
> {code:java}
>   public static TimestampLogicalTypeAnnotation timestampType(final boolean isAdjustedToUTC, final TimeUnit unit) {
> {code}
> This means that the data written to Parquet should be "adjusted to UTC", that is, in the UTC timezone. Thus, Drill must convert from the "Drill timestamp" in the local time zone to UTC when writing to Parquet.
> h3. Parquet Timestamp Read
> On the read side, the class {{ParquetToDrillTypeConverter}} maps Parquet types to Drill types:
> {code:java}
>           case TIMESTAMP_MILLIS:
>           case TIMESTAMP_MICROS:
>             return TypeProtos.MinorType.TIMESTAMP;
> {code}
> The {{NullableFixedByteAlignedReaders}} read some values. Here we see a bug:
> {code:java}
>   static class NullableDictionaryTimeStampReader extends NullableColumnReader<NullableTimeStampVector> {
>     @Override
>     protected void readField(long recordsToReadInThisPass) {
>       ValuesReader valReader = usingDictionary ? pageReader.getDictionaryValueReader() : pageReader.getValueReader();
>       for (int i = 0; i < recordsToReadInThisPass; i++){
>         valueVec.getMutator().setSafe(valuesReadInCurrentPass + i, valReader.readLong());
>       }
>       advanceWriterIndex();
>     }
>   }
> {code}
> Here we see that we read the value from Parquet and write that value _without conversion_, into the Drill Timestamp vector. That is, when reading, we assume that the Parquet data is in the local time zone. We completely ignore the {{isAdjustedToUTC}} value which we asked to be set. Hence, we write UTC, but read UTC as local time.
> The {{isAdjustedToUTC}} attribute in Parquet may have been added after the original Drill code was written. As a result, Drill assumes local time, ignoring whether the data is in UTC or not. 
> To recover the UTC attribute, we must consult the {{SchemaElement}} passed into {{NullableDictionaryTimeStampReader}}. This is a Thrift type with the required information in the {{LogicalType logicalType}} field. The {{TimestampType}} subclass provides the required information.
> h3. Metadata-based Pruning
> Parquet is made considerably more complex by the use of metadata: the planner prunes the scan before the scan operator even starts. Thus, errors in local vs. UTC occur in this phase as well. Modifying the writer to write UTC causes the filter pruning to fail, presumably because that pruning code assumes that the dates in Parquet are local. To see this, run {{TestCastFunctions.testCastTimestampLiteralInFilter}}. It the data written to Parquet is corrected to be UTC, then the Parquet reader is asked to read 0 records, though it should read the single record in the file.
> h3. Recommendation
> Divide the problem into three parts:
> # Determine if this is a problem. Perhaps users don't store timestamps? Or, have worked out ways to work around the incorrect behavior? Do we have the resources to tackle such a complex issue?
> # Specify the correct behavior. Perhaps we need yet another option (groan) to say whether we write Drill Timestamps as local or UTC. (That is, whether we set the {{isAdjustedToUTC}} attribute.) We can always write local (with the attribute set to {{false}}, always write UT (with the attribute set to {{true}}), or let the user choose. What we can't do is claim to write UTC but actually write local time.
> # Determine user impact. Uses that have learned to work around the current incorrect behavior may encounter problems when we switch to the correct behavior.
> Then, to implement:
> # Clearly state the rules. Which parts of Drill use local time (the Timestamp vector) and which use UTC.
> # Specify the preferred conversions to/from UTC and Drill timestamps. (Currently every bit of code does it its own way.)
> # Identify in detail what Parquet, and its metadata pruning feature, does today.
> # Identify changes required.
> # Implement the changes.
> # Provide a robust set of unit tests that verify the intended behavior. Resist the temptation to fudge as current tests do, by forcing the timezone to UTC.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)