You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@orc.apache.org by wgtmac <gi...@git.apache.org> on 2018/03/16 17:25:53 UTC

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

GitHub user wgtmac opened a pull request:

    https://github.com/apache/orc/pull/233

    ORC-322: [C++] Fix writing & reading timestamp

    Currently C++ and Java version have different behaviors in reading and writing values of timestamp type. This patch ensures C++ reader/printer will behave the same as their parities on the java side and also ensure C++ reader and writer obtain same timestamp values in the TimestampVectorBatch.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wgtmac/orc ORC-322

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #233
    
----
commit b8afd28264cd63cd23e5ca2cca057e45d263e694
Author: Gang Wu <ga...@...>
Date:   2018-03-16T17:17:02Z

    ORC-322: [C++] Fix writing & reading timestamp
    
    Currently C++ and Java version have different behaviors in reading and writing values of timestamp type. This patch ensures C++ reader/printer will behave the same as their parities on the java side and also ensure C++ reader and writer obtain same timestamp values in the TimestampVectorBatch.

----


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175256539
  
    --- Diff: c++/src/ColumnWriter.cc ---
    @@ -1194,12 +1194,11 @@ namespace orc {
         bool hasNull = false;
         for (uint64_t i = 0; i < numValues; ++i) {
           if (notNull == nullptr || notNull[i]) {
    -        // TimestampVectorBatch already stores data in UTC
    -        int64_t millsUTC = secs[i] * 1000 + nanos[i] / 1000000;
    +        int64_t millsUTC =
    +          (secs[i] + timezone.getVariant(secs[i]).gmtOffset) * 1000 + nanos[i] / 1000000;
    --- End diff --
    
    I think the original data in TimestampVectorBatch should still be UTC timestamps. So we don't need the gmtOffset adjustment here.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/233
  
    Can we extend this by adding a `WriterOption` to provide a timezone name (default to "GMT")?


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175257368
  
    --- Diff: c++/src/ColumnWriter.cc ---
    @@ -1194,12 +1194,11 @@ namespace orc {
         bool hasNull = false;
         for (uint64_t i = 0; i < numValues; ++i) {
           if (notNull == nullptr || notNull[i]) {
    -        // TimestampVectorBatch already stores data in UTC
    -        int64_t millsUTC = secs[i] * 1000 + nanos[i] / 1000000;
    +        int64_t millsUTC =
    +          (secs[i] + timezone.getVariant(secs[i]).gmtOffset) * 1000 + nanos[i] / 1000000;
    --- End diff --
    
    You can check TimestampTreeWriter.java:113. Without this line, we will see inconsistency with java code.


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r176910542
  
    --- Diff: c++/src/ColumnWriter.cc ---
    @@ -1194,9 +1194,8 @@ namespace orc {
         bool hasNull = false;
         for (uint64_t i = 0; i < numValues; ++i) {
           if (notNull == nullptr || notNull[i]) {
    -        // TimestampVectorBatch stores data in local timezone
    -        int64_t millsUTC =
    -          timezone.convertToUTC(secs[i]) * 1000 + nanos[i] / 1000000;
    +        // TimestampVectorBatch stores data in UTC timezone
    --- End diff --
    
    Done


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    Yes I agree with what you've said. The only problem of using UTC in the C++ ColumnVectorBatch is that there is an issue for C++ writer to convert timestamp in the DST back to local timezone. This will cause data correctness issue.
    
    BTW, since you've mentioned ORC 2.0, what's the plan for decimal encoding? And what's the timeline for its development and release? Thanks!
    @omalley 


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @majetideepak Can you take a look again? This patch enforces UTC timestamps are used everywhere in C++ and does not affect current C++ reader. Thanks!


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    I have finally changed this patch to always use local timezone in TimestampVectorBatch for both C++ writer and reader which is consistent with java code. Please take a look. Thanks! @omalley @majetideepak @stiga-huang @xndai 


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on the issue:

    https://github.com/apache/orc/pull/233
  
    Sorry, I am confused after reading the discussions above. The key question I have is - do we implement ORC TIMESTAMP as SQL "TIMESTAMP with Timezone" or "TIMESTAMP without Timezone"? It seems to me that we implement it as the later one. That's why we went for a rather complicated design that involved local epoch and logics to handle DST while moving between time zones. But @majetideepak comment above stated that we wanted to implement TIMESTAMP as TIMESTAMP with Tz with ORC-10. This contradicts what I saw in Java reader implementation in which timestamp value is adjusted per reader time zone (TreeReaderFactory.java line 987 to 992). 
    
    So if my understanding is correct, which means TIMESTAMP should be implemented as TIMESTAMP w/o Tz, then the current C++ reader has a bug that it always adjusts to gmt rather than the reader timezone (ColumnReader.cc line 339, 340). 
    
    @omalley is probably the best person to answer this question...


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    Thanks @majetideepak for comment!
    
    On the Java side, the input timestamp in writer TimestampColumnVector is in UTC. It leverages java.sql.Timestamp which knows the local timezone info so that it can PRINT in local timezone. You can print millis variable in line 109 in TimestampTreeWriter.java to verify this. The name of SerializationUtils.convertToUtc(localTimezone, millis) in line 113 is kind of confusing, because the result is not the current timestamp in UTC but adds an offset to local timezone which I think it is also a problem.
    
    ORC-10 has fixed the bug without writer timezone. The original design is to be resilient to move between different reader timezones. However this caused an issue in C++ between different daylight saving timezones and writer timezone is forced to be written. ORC-10 adds GMT offset is actually converting the value to local timezone so that ColumnPrinter can print the same time in local timezone. This causes a new problem that C++ reader gets timestamp value in local timezone, not UTC and it is different from java reader. I believe this is why @owen has created [ORC-37](https://issues.apache.org/jira/browse/ORC-37). SQL type TimestampTz is a new type other than traditional SQL type Timestamp, I don't think it is a good idea to mix ORC timestamp type with TimestampTz and there is another open issue for it: [ORC-189](https://issues.apache.org/jira/browse/ORC-189)
    
    It is very confusing that an input timestamp written using Java writer is read differently via C++ reader. I think we need to fix it and this can also resolve ORC-37. What do you think?


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/233
  
    The `utc` value from`long utc = SerializationUtils.convertToUtc(localTimezone, millis);` is only used to update the statistics and not the values. The epoch is also w.r.t to local time and not UTC.
    This is an API breaking change and I would wait for @omalley to comment on this before checking this in. We should definitely document in detail the behavior of Timestamp values.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @wgtmac and @stiga-huang you are right that the C++ and Java writers must write the same value to a file for a given input timestamp value. Looks like the Java side writes the timestamp values provided as is in local time (no conversion) and writes the writer timezone in the footer (however, stats are in UTC). We must do the same for the C++ writer as well if not already.
    
    ORC-10 adds GMT offset when reading the values back. Therefore, the C++ reader always returns values in UTC. The current behavior of ORC reader for timestamp values is the same as SQL `TimestampTz`.
    To get the same values back ( aka SQL `Timestamp`), you need to convert the values read back to local time.
    
    If you read a timestamp column from an ORC file and plan to write it immediately, you must first convert the values to the local time before writing.


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175256583
  
    --- Diff: c++/src/ColumnWriter.cc ---
    @@ -1194,12 +1194,11 @@ namespace orc {
         bool hasNull = false;
         for (uint64_t i = 0; i < numValues; ++i) {
           if (notNull == nullptr || notNull[i]) {
    -        // TimestampVectorBatch already stores data in UTC
    -        int64_t millsUTC = secs[i] * 1000 + nanos[i] / 1000000;
    +        int64_t millsUTC =
    +          (secs[i] + timezone.getVariant(secs[i]).gmtOffset) * 1000 + nanos[i] / 1000000;
             tsStats->increase(1);
             tsStats->update(millsUTC);
     
    -        secs[i] -= timezone.getVariant(secs[i]).gmtOffset;
             secs[i] -= timezone.getEpoch();
    --- End diff --
    
    It'd be more elegant to treat the input data as read-only, so ColumnWriter won't has side-effects on caller's data.


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
GitHub user wgtmac reopened a pull request:

    https://github.com/apache/orc/pull/233

    ORC-322: [C++] Fix writing & reading timestamp

    Currently C++ and Java version have different behaviors in reading and writing values of timestamp type. This patch ensures C++ reader/printer will behave the same as their parities on the java side and also ensure C++ reader and writer obtain same timestamp values in the TimestampVectorBatch.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/wgtmac/orc ORC-322

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/orc/pull/233.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #233
    
----
commit b8afd28264cd63cd23e5ca2cca057e45d263e694
Author: Gang Wu <ga...@...>
Date:   2018-03-16T17:17:02Z

    ORC-322: [C++] Fix writing & reading timestamp
    
    Currently C++ and Java version have different behaviors in reading and writing values of timestamp type. This patch ensures C++ reader/printer will behave the same as their parities on the java side and also ensure C++ reader and writer obtain same timestamp values in the TimestampVectorBatch.

commit 828b10af844a22fd49fa4ce4095d0cec96d8eee5
Author: Gang Wu <ga...@...>
Date:   2018-03-17T03:03:36Z

    Fix include path and tests

----


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175257109
  
    --- Diff: c++/src/ColumnPrinter.cc ---
    @@ -710,7 +713,10 @@ namespace orc {
           writeString(buffer, "null");
         } else {
           int64_t nanos = nanoseconds[rowId];
    -      time_t secs = static_cast<time_t>(seconds[rowId]);
    +      // adjust the second to gmt timezone so that we can print same value
    +      // in writer's timezone
    +      time_t secs = static_cast<time_t>(seconds[rowId] +
    +        writerTimezone->getVariant(seconds[rowId]).gmtOffset);
           struct tm tmValue;
           gmtime_r(&secs, &tmValue);
           char timeBuffer[20];
    --- End diff --
    
    Good idea! But I don't think we have simple timezone acronyms on hand so I prefer to ignore it for now.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @omalley @majetideepak @wgtmac Thanks for your follow up on ORC-322! If I understand these correctly, the convention is that TimestampColumnVector should only accept timestamps in local time. Timestamp values stored in ORC file are `local_timestamp - local_orc_epoch`. TimestampColumnVector got from the java reader has timestamps in local time. However, TimestampColumnVector got from the c++ reader has UTC timestamps.
    
    If so, the c++ writer doesn't need to minus gmtOffset for each timestamp, because after shifting the values in ORC file are `utc_timestamp - local_orc_epoch`.
    
    If not, I think the bug in ORC-320 should still be fixed (ORC-322 is aimed to fix ORC-320). The root cause of ORC-320 is that gmtOffsets got in writer and reader can be different, though they're using the same Timezone.
    
    To be specific, the writer gets gmtOffset by timestamp `ts`, then writes down `ts - gmtOffset` (Let's ignore the orc epoch since it's the same in writer and reader). The reader use `ts - gmtOffset` to get gmtOffset2, then read out `ts - gmtOffset + gmtOffset2`. However, `gmtOffset2` may not equal to `gmtOffset`.
    
    Thanks for your patience reading this long comment!


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/233
  
    I haven't looked at the details of this patch. I do note that I had messed up the Java side for timestamps before 1970 with nanos in the range of 1 to 999,999. *sigh* See ORC-306.


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175290353
  
    --- Diff: c++/src/ColumnWriter.cc ---
    @@ -1194,12 +1194,11 @@ namespace orc {
         bool hasNull = false;
         for (uint64_t i = 0; i < numValues; ++i) {
           if (notNull == nullptr || notNull[i]) {
    -        // TimestampVectorBatch already stores data in UTC
    -        int64_t millsUTC = secs[i] * 1000 + nanos[i] / 1000000;
    +        int64_t millsUTC =
    +          (secs[i] + timezone.getVariant(secs[i]).gmtOffset) * 1000 + nanos[i] / 1000000;
             tsStats->increase(1);
             tsStats->update(millsUTC);
     
    -        secs[i] -= timezone.getVariant(secs[i]).gmtOffset;
             secs[i] -= timezone.getEpoch();
    --- End diff --
    
    This was a trade-off. If we have made the input data constant, then we need to allocate another buffer to store those modified values which may introduce larger memory consumption.


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac closed the pull request at:

    https://github.com/apache/orc/pull/233


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @omalley @majetideepak @stiga-huang After second thought, I have changed this PR not to break backward compatibility and now TimestampVectorBatch has the same meaning in writer and reader. Please take a look when you have time. Thanks!


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175257358
  
    --- Diff: c++/src/ColumnPrinter.cc ---
    @@ -710,7 +713,10 @@ namespace orc {
           writeString(buffer, "null");
         } else {
           int64_t nanos = nanoseconds[rowId];
    -      time_t secs = static_cast<time_t>(seconds[rowId]);
    +      // adjust the second to gmt timezone so that we can print same value
    +      // in writer's timezone
    +      time_t secs = static_cast<time_t>(seconds[rowId] +
    +        writerTimezone->getVariant(seconds[rowId]).gmtOffset);
           struct tm tmValue;
           gmtime_r(&secs, &tmValue);
           char timeBuffer[20];
    --- End diff --
    
    I think we can simply use `TimezoneVariant::name` as the acronyms.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @majetideepak Done! I didn't add setWriterTimezone method because we don't expect users to change the writer timezone.
    Please also take a look at this minor fix: https://github.com/apache/orc/pull/238
    Thanks!


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r176909659
  
    --- Diff: c++/src/ColumnWriter.cc ---
    @@ -1194,9 +1194,8 @@ namespace orc {
         bool hasNull = false;
         for (uint64_t i = 0; i < numValues; ++i) {
           if (notNull == nullptr || notNull[i]) {
    -        // TimestampVectorBatch stores data in local timezone
    -        int64_t millsUTC =
    -          timezone.convertToUTC(secs[i]) * 1000 + nanos[i] / 1000000;
    +        // TimestampVectorBatch stores data in UTC timezone
    --- End diff --
    
    Can you add a comment for TimestampVectorBatch regarding this requirement?


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175257375
  
    --- Diff: c++/src/Vector.cc ---
    @@ -436,4 +438,8 @@ namespace orc {
               + static_cast<uint64_t>(
                   (data.capacity() + nanoseconds.capacity()) * sizeof(int64_t));
       }
    +
    +  int64_t TimestampVectorBatch::getSecondInWriterTZ(uint64_t rowId) {
    +    return data[rowId] + writerTimezone->getVariant(data[rowId]).gmtOffset;
    --- End diff --
    
    Good suggestion!


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175256428
  
    --- Diff: c++/src/ColumnPrinter.cc ---
    @@ -710,7 +713,10 @@ namespace orc {
           writeString(buffer, "null");
         } else {
           int64_t nanos = nanoseconds[rowId];
    -      time_t secs = static_cast<time_t>(seconds[rowId]);
    +      // adjust the second to gmt timezone so that we can print same value
    +      // in writer's timezone
    +      time_t secs = static_cast<time_t>(seconds[rowId] +
    +        writerTimezone->getVariant(seconds[rowId]).gmtOffset);
           struct tm tmValue;
           gmtime_r(&secs, &tmValue);
           char timeBuffer[20];
    --- End diff --
    
    Can we print the writer timezone as well to make this string clear?


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @stiga-huang Yes, you're right. Java TimestampColumnVector always use local timezone. But C++ reader uses UTC in TimestampVectorBatch. I get your point for different gmtOffsets. Let me try to reproduce it. 


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r175256695
  
    --- Diff: c++/src/Vector.cc ---
    @@ -436,4 +438,8 @@ namespace orc {
               + static_cast<uint64_t>(
                   (data.capacity() + nanoseconds.capacity()) * sizeof(int64_t));
       }
    +
    +  int64_t TimestampVectorBatch::getSecondInWriterTZ(uint64_t rowId) {
    +    return data[rowId] + writerTimezone->getVariant(data[rowId]).gmtOffset;
    --- End diff --
    
    Can we wrap the logic `ts + timezone->getVariant(ts).gmtOffset` as a function (e.g. `Timezone::getLocalSecs(long utcSecs)`) in Timezone? Looks like we frequently use it.


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on a diff in the pull request:

    https://github.com/apache/orc/pull/233#discussion_r176910052
  
    --- Diff: c++/src/ColumnReader.cc ---
    @@ -336,8 +336,7 @@ namespace orc {
               }
             }
             int64_t writerTime = secsBuffer[i] + epochOffset;
    -        secsBuffer[i] = writerTime +
    -          writerTimezone.getVariant(writerTime).gmtOffset;
    +        secsBuffer[i] = writerTimezone.convertToUTC(writerTime);
    --- End diff --
    
    If it reads a file created by Java writer with writer timezone other than GMT, you are not going to get the same clock reading from ColumnPrinter, are you?


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    Thanks @majetideepak !


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @omalley @majetideepak can you please take a look at this patch? @stiga-huang and I think there's an issue with the C++ reader and writer which may have correctness issue. 
    
    @majetideepak can you also verify whether Vertica and Arrow have issues with timestamp values using C++ reader? Thanks!


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/233
  
    I also think that we should adjust the value being read from the C++ reader w.r.t to the reader and writer timezones (if they are different) like the Java reader implementation. The current C++ behavior is definitely inconsistent with Java (which does things the right way).


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by majetideepak <gi...@git.apache.org>.
Github user majetideepak commented on the issue:

    https://github.com/apache/orc/pull/233
  
    You are right about the last commit. It is redundant! I will remove that commit and merge this. Thanks!


---

[GitHub] orc pull request #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by asfgit <gi...@git.apache.org>.
Github user asfgit closed the pull request at:

    https://github.com/apache/orc/pull/233


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by omalley <gi...@git.apache.org>.
Github user omalley commented on the issue:

    https://github.com/apache/orc/pull/233
  
    So the semantics of the current timestamp type are that it does not change with the reader's timezone. It should always come back the same. If you put '2018-03-23 12:34:56.789' in, it should always come back exactly the same.
    
    The easiest way to get that effect is to use UTC in the ColumnVectorBatch, so that is what we do. This is different from the Java side because there we needed to use Java's Timestamp class,  which always uses local.
    
    We are going to add a new type that is "timestamp with local timezone" in ORC-189. That is for the instance in time where timezone is taken in to account.
    
    The encoding of timestamps in ORC 1.x is overly complicated because I messed it up and we've tried to maintain backwards and forwards compatibility. In ORC 2 we'll just use UTC for encoding both types.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on the issue:

    https://github.com/apache/orc/pull/233
  
    IMO the ideal approach of achieving the timestamp semantics is to have the caller pass in session timezone, and the reader return timestamp batch based on that. That would be similar to what we do with Java reader today except that Java reader assumes current thread timezone as session timezone, which in some case can be wrong. The other option is we return timestamp in a fixed timezone (such as GMT as C++ reader does today) and the behavior needs to be documented as part of the interface. The caller can always convert it to session timezone accordingly. But currently the Java and C++ reader behaves differently which can be misleading.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @wgtmac To reproduce ORC-320, you can try timestamps in the description under America/Los_Angeles timezone.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by xndai <gi...@git.apache.org>.
Github user xndai commented on the issue:

    https://github.com/apache/orc/pull/233
  
    LGTM. @wgtmac 's fix will work, but I still think we need to keep same behavior for C++ and Java in long run.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by wgtmac <gi...@git.apache.org>.
Github user wgtmac commented on the issue:

    https://github.com/apache/orc/pull/233
  
    @stiga-huang  Yes I can reproduce ORC-320. How about the latest patch? This also fixes ORC-320.


---

[GitHub] orc issue #233: ORC-322: [C++] Fix writing & reading timestamp

Posted by stiga-huang <gi...@git.apache.org>.
Github user stiga-huang commented on the issue:

    https://github.com/apache/orc/pull/233
  
    LGTM. We just need to document that users should pass in UTC timestamps when using the c++ writer.
    Thanks, @wgtmac!


---