You are viewing a plain text version of this content. The canonical link for it is here.
Posted to github@arrow.apache.org by "wgtmac (via GitHub)" <gi...@apache.org> on 2023/03/20 06:02:48 UTC

[GitHub] [arrow] wgtmac commented on a diff in pull request #34591: GH-34590: [C++][ORC] Fix timestamp type mapping between orc and arrow

wgtmac commented on code in PR #34591:
URL: https://github.com/apache/arrow/pull/34591#discussion_r1141659988


##########
cpp/src/arrow/adapters/orc/util.cc:
##########
@@ -1111,7 +1120,11 @@ Result<std::shared_ptr<DataType>> GetArrowType(const liborc::Type* type) {
     case liborc::CHAR:
       return fixed_size_binary(static_cast<int>(type->getMaximumLength()));
     case liborc::TIMESTAMP:
+      // The timestamp values stored in ORC are in the writer timezone.
       return timestamp(TimeUnit::NANO);

Review Comment:
   Yes, you're right.
   
   For `orc::TIMESTAMP` type:
   - The Orc writer expects input data (i.e. in the orc::TimestampVectorBatch) to be in the "UTC" timezone and serializes it into the writer timezone: https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1717
   - The Orc reader deserializes the data from writer timezone and restores it into reader timezone: https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L336
   
   For `orc::TIMESTAMP_INSTANT` type:
   - The Orc writer expects input data to be in the "UTC" timezone and serializes it into the "UTC" timezone: https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnWriter.cc#L1644
   - The Orc reader deserializes the data from "UTC" timezone and no more conversion is needed because writerTimezone and readerTimezone are both "UTC": https://github.com/apache/orc/blob/main/c%2B%2B/src/ColumnReader.cc#L282
   
   We have seen many issues around `orc::TIMESTAMP` type because of the writer-reader timezone conversion, especially with different day-light saving rules. So that's why `orc::TIMESTAMP_INSTANT` type is added and is always preferred over `orc::TIMESTAMP` type if user can take care of the timezone.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscribe@arrow.apache.org

For queries about this service, please contact Infrastructure at:
users@infra.apache.org