You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@parquet.apache.org by Wes McKinney <we...@gmail.com> on 2019/07/09 15:38:51 UTC

Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

hi folks,

We have just recently implemented the new LogicalType unions in the
Parquet C++ library and we have run into a forward compatibility
problem with reader versions prior to this implementation.

To recap the issue, prior to the introduction of LogicalType, the
Parquet format had no explicit notion of time zones or UTC
normalization. The new TimestampType provides a flag to indicate
UTC-normalization

struct TimestampType {
1: required bool isAdjustedToUTC
2: required TimeUnit unit
}

When using this new type, the ConvertedType field must also be set for
forward compatibility (so that old readers can still understand the
data), but parquet.thrift says

// use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
true, unit = MICROS)
// use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
true, unit = MILLIS)
8: TimestampType TIMESTAMP

In Apache Arrow, we have 2 varieties of timestamps:

* Timestamp without time zone (no UTC normalization indicated)
* Timestamp with time zone (values UTC-normalized)

Prior to the introduction of LogicalType, we would set either
TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
normalization. So when reading the data back, any notion of having had
a time zone is lost (it could be stored in schema metadata if
desired).

I believe that setting the TIMESTAMP_* ConvertedType _only_ when
isAdjustedToUTC is true creates a forward compatibility break in this
regard. This was reported to us shortly after releasing Apache Arrow
0.14.0:

https://issues.apache.org/jira/browse/ARROW-5878

We are discussing setting the ConvertedType unconditionally in

https://github.com/apache/arrow/pull/4825

This might need to be a setting that is toggled when data is coming
from Arrow, but I wonder if the text in parquet.thrift is the intended
forward compatibility interpretation, and if not should we amend.

Thanks,
Wes

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Wes,

Sounds good to me.

Br,

Zoltan

On Thu, Jul 11, 2019 at 4:14 PM Wes McKinney <we...@gmail.com> wrote:
>
> On Thu, Jul 11, 2019 at 8:17 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
> >
> > Hi Wes,
> >
> > I did a little bit of testing using pyarrow 0.14.0. I know that this
> > is not the latest state of the code, but my understanding is that the
> > only planned change is that 0.14.1 will add the legacy types even for
> > local semantics. Here is what I did:
> >
> >     In [1]: import pandas as pd
> >        ...: from datetime import datetime
> >        ...: from pandas import Timestamp
> >        ...: from pytz import timezone
> >        ...: ts_dt = datetime(1970, 1, 1)
> >        ...: ts_pd = Timestamp(year=1970, month=1, day=1)
> >        ...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1,
> > tz=timezone("Europe/Paris"))
> >        ...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1,
> > hour=2, tz=timezone("Europe/Helsinki"))
> >        ...: df = pd.DataFrame({
> >        ...:     'datetime': [ts_dt, None],
> >        ...:     'pd_no_tz': [ts_pd, None],
> >        ...:     'pd_paris': [ts_pd_paris, None],
> >        ...:     'pd_helsinki': [ts_pd_helsinki, None],
> >        ...:     'pd_mixed': [ts_pd_paris, ts_pd_helsinki]
> >        ...: })
> >        ...: df.to_parquet('test.parquet')
> >        ...: df
> >     Out[1]:
> >         datetime   pd_no_tz                  pd_paris
> > pd_helsinki                   pd_mixed
> >     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> > 02:00:00+02:00  1970-01-01 01:00:00+01:00
> >     1        NaT        NaT                       NaT
> >      NaT  1970-01-01 02:00:00+02:00
> >
> > I picked these values because I expected all of these timestamps to be
> > saved as the numeric value 0. Checking the metadata using
> > parquet-tools:
> >
> >     > parquet-tools meta test.parquet
> >     datetime:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
> >     pd_no_tz:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
> >     pd_paris:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> >     pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
> >     pd_mixed:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
> >
> > This matched my expectations up until pd_mixed. I was surprised to see
> > that timestamps with mixed time zones were be stored using local
> > semantics instead of being normalized to UTC, but if it's an
> > intentional choice, I'm fine with it as long as the numbers are
> > correct:
> >
> >     > parquet-tools head test.parquet
> >     datetime = 0
> >     pd_no_tz = 0
> >     pd_paris = 0
> >     pd_helsinki = 0
> >     pd_mixed = 3600000
> >     pd_mixed = 7200000
> >
> > The numbers for the mixed column are not 0, but that is just the
> > result of not using UTC-normalized semantics there. The values can be
> > correctly interpreted by parquet-tools:
> >
> >     > parquet-tools dump test.parquet
> >     INT64 datetime
> >     --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_no_tz
> >     --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_paris
> >     --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_helsinki
> >     --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
> >     value 2: R:0 D:0 V:<null>
> >
> >     INT64 pd_mixed
> >     --------------------------------------------------------------------------------
> >     *** row group 1 of 1, values 1 to 2 ***
> >     value 1: R:0 D:1 V:1970-01-01T01:00:00.000
> >     value 2: R:0 D:1 V:1970-01-01T02:00:00.000
> >
> > And naturally by pandas as well:
> >
> >     In [2]: df2 = pd.read_parquet('test.parquet')
> >        ...: df2
> >     Out[2]:
> >         datetime   pd_no_tz                  pd_paris
> > pd_helsinki            pd_mixed
> >     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> > 02:00:00+02:00 1970-01-01 01:00:00
> >     1        NaT        NaT                       NaT
> >      NaT 1970-01-01 02:00:00
> >
> > Finally, let's try an older version of parquet-tools as well:
> >
> >     > parquet-tools meta test.parquet
> >     datetime:    OPTIONAL INT64 R:0 D:1
> >     pd_no_tz:    OPTIONAL INT64 R:0 D:1
> >     pd_paris:    OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
> >     pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
> >     pd_mixed:    OPTIONAL INT64 R:0 D:1
> >
>
> Just to confirm that in pyarrow 0.13.0 (which tracks
> parquet-cpp-in-development 1.6.0, prior to introduction of
> LogicalType), TIMESTAMP_* is set for all 5 columns. This is the
> forward compatibility that we are planning to restore.
>
> > This confirms that the legacy types are only written for
> > UTC-normalized timestamps in 0.14.0.
> >
> > In summary, when saving two timestamps from different time zones that
> > refer to the same instant, I would have expected pyarrow to normalize
> > to UTC and thereby sacrifice the local representations instead of
> > saving using local semantics and thereby sacrificing the instants. I
> > don't know whether that is the intended behaviour, but in any case,
> > based on this short manual testing, the new timestamp types written by
> > pyarrow are interopable with the Java library.
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 4:30 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > Correct
> > >
> > > On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > Do you mean that the new logical types have already been released in 0.14.0
> > > > and a 0.14.1 is needed ASAP to fix this regression?
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <we...@gmail.com> wrote:
> > > >
> > > > > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> > > > > issues and others we need to make a new release within the next 7-10
> > > > > days. We can point you to nightly Python builds to make testing for
> > > > > you easier so you don't have to build the project yourself.
> > > > >
> > > > > - Wes
> > > > >
> > > > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > > > wrote:
> > > > > >
> > > > > > Hi,
> > > > > >
> > > > > > Oh, and one more thing: Before releasing the next Arrow version
> > > > > > incorporating the new logical types, we should definitely test that their
> > > > > > behaviour matches that of parquet-mr. When is the next release planned to
> > > > > > come out?
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
> > > > > >
> > > > > > > Hi Wes,
> > > > > > >
> > > > > > > Yes, I agree that we should do that, but then we have a problem of
> > > > > what to
> > > > > > > do in the other direction, i.e. when we use the new logical types API
> > > > > to
> > > > > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > > > > > normalized flag? Tim has started a discussion about that, suggesting
> > > > > to use
> > > > > > > three states that I just answered.
> > > > > > >
> > > > > > > Br,
> > > > > > >
> > > > > > > Zoltan
> > > > > > >
> > > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com>
> > > > > wrote:
> > > > > > >
> > > > > > >> Thank for the comments.
> > > > > > >>
> > > > > > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > > > > > >> types to maintain forward compatibility and stay consistent with what
> > > > > > >> we were doing in the C++ library prior to the introduction of the
> > > > > > >> LogicalType metadata.
> > > > > > >>
> > > > > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > > > >
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > Hi Wes,
> > > > > > >> >
> > > > > > >> > Both of the semantics are deterministic in one aspect and
> > > > > > >> indeterministic
> > > > > > >> > in another. Timestamps of instant semantic will always refer to the
> > > > > same
> > > > > > >> > instant, but their user-facing representation (how they get
> > > > > displayed)
> > > > > > >> > depends on the user's time zone. Timestamps of local semantics
> > > > > always
> > > > > > >> have
> > > > > > >> > the same user-facing representation but the instant they refer to is
> > > > > > >> > undefined (or ambigous, depending on your point of view).
> > > > > > >> >
> > > > > > >> > My understanding is that Spark uses instant semantics, i.e.,
> > > > > timestamps
> > > > > > >> are
> > > > > > >> > stored normalized to UTC and are displayed adjusted to the user's
> > > > > local
> > > > > > >> > time zone.
> > > > > > >> >
> > > > > > >> > Br,
> > > > > > >> >
> > > > > > >> > Zoltan
> > > > > > >> >
> > > > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com>
> > > > > > >> wrote:
> > > > > > >> >
> > > > > > >> > > Thanks Zoltan.
> > > > > > >> > >
> > > > > > >> > > This is definitely a tricky issue.
> > > > > > >> > >
> > > > > > >> > > Spark's application of localtime semantics to timestamp data has
> > > > > been
> > > > > > >> > > a source of issues for many people. Personally I don't find that
> > > > > > >> > > behavior to be particularly helpful since depending on the session
> > > > > > >> > > time zone, you will get different results on data not marked as
> > > > > > >> > > UTC-normalized.
> > > > > > >> > >
> > > > > > >> > > In pandas, as contrast, we apply UTC semantics to
> > > > > > >> > > naive/not-explicitly-normalized data so at least the code produces
> > > > > > >> > > deterministic results on all environments. That seems strictly
> > > > > better
> > > > > > >> > > to me -- if you want a localized interpretation of naive data, you
> > > > > > >> > > must opt into this rather than having it automatically selected
> > > > > based
> > > > > > >> > > on your locale. The instances of people shooting their toes off
> > > > > due to
> > > > > > >> > > time zones are practically non-existent, whereas I'm hearing about
> > > > > > >> > > Spark gotchas all the time.
> > > > > > >> > >
> > > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> > > > > <zi@cloudera.com.invalid
> > > > > > >> >
> > > > > > >> > > wrote:
> > > > > > >> > > >
> > > > > > >> > > > Hi Wes,
> > > > > > >> > > >
> > > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created
> > > > > based on
> > > > > > >> the
> > > > > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > > > > > >> been used
> > > > > > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > > > > > >> assumption was
> > > > > > >> > > > supported by two sources:
> > > > > > >> > > >
> > > > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> > > > > and
> > > > > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> > > > > since
> > > > > > >> the
> > > > > > >> > > Unix
> > > > > > >> > > > epoch, an instant specified in UTC, from which it follows that
> > > > > they
> > > > > > >> have
> > > > > > >> > > > instant semantics (because timestamps of local semantics do not
> > > > > > >> > > correspond
> > > > > > >> > > > to a single instant).
> > > > > > >> > > >
> > > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> > > > > component
> > > > > > >> that
> > > > > > >> > > > used these types differently from the specification.
> > > > > > >> > > >
> > > > > > >> > > > Based on your e-mail, we were wrong on #2.
> > > > > > >> > > >
> > > > > > >> > > > From this false premise it followed that TIMESTAMPs with local
> > > > > > >> semantics
> > > > > > >> > > > were a new type and did not need to be annotated with the old
> > > > > types
> > > > > > >> to
> > > > > > >> > > > maintain compatibility. In fact, annotating them with the old
> > > > > types
> > > > > > >> were
> > > > > > >> > > > considered to be harmful, since it would have mislead older
> > > > > readers
> > > > > > >> into
> > > > > > >> > > > thinking that they can read TIMESTAMPs with local semantics,
> > > > > when in
> > > > > > >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> > > > > > >> instant
> > > > > > >> > > > semantics. This would have lead to a difference of several
> > > > > hours,
> > > > > > >> > > > corresponding to the time zone offset.
> > > > > > >> > > >
> > > > > > >> > > > In the light of your e-mail, this misinterpretation of
> > > > > timestamps
> > > > > > >> may
> > > > > > >> > > > already be happening, since if Arrow annotates local timestamps
> > > > > with
> > > > > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> > > > > misinterprets
> > > > > > >> them
> > > > > > >> > > as
> > > > > > >> > > > timestamps with instant semantics, leading to a difference of
> > > > > > >> several
> > > > > > >> > > hours.
> > > > > > >> > > >
> > > > > > >> > > > Based on this, I think it would make sense from Arrow's point of
> > > > > > >> view to
> > > > > > >> > > > annotate both semantics with the old types, since that is its
> > > > > > >> historical
> > > > > > >> > > > behaviour and keeping it up is needed for maintaining
> > > > > compatibilty.
> > > > > > >> I'm
> > > > > > >> > > not
> > > > > > >> > > > so sure about the Java library though, since as far as I know,
> > > > > these
> > > > > > >> > > types
> > > > > > >> > > > were never used in the local sense there (although I may be
> > > > > wrong
> > > > > > >> again).
> > > > > > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > > > > > >> differently in
> > > > > > >> > > > this aspect though, it may be tricky to convey this distinction
> > > > > in
> > > > > > >> the
> > > > > > >> > > > specification. I would be interested in hearing your and other
> > > > > > >> > > developers'
> > > > > > >> > > > opinions on this.
> > > > > > >> > > >
> > > > > > >> > > > Thanks,
> > > > > > >> > > >
> > > > > > >> > > > Zoltan
> > > > > > >> > > >
> > > > > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> > > > > wesmckinn@gmail.com>
> > > > > > >> wrote:
> > > > > > >> > > >
> > > > > > >> > > > > hi folks,
> > > > > > >> > > > >
> > > > > > >> > > > > We have just recently implemented the new LogicalType unions
> > > > > in
> > > > > > >> the
> > > > > > >> > > > > Parquet C++ library and we have run into a forward
> > > > > compatibility
> > > > > > >> > > > > problem with reader versions prior to this implementation.
> > > > > > >> > > > >
> > > > > > >> > > > > To recap the issue, prior to the introduction of LogicalType,
> > > > > the
> > > > > > >> > > > > Parquet format had no explicit notion of time zones or UTC
> > > > > > >> > > > > normalization. The new TimestampType provides a flag to
> > > > > indicate
> > > > > > >> > > > > UTC-normalization
> > > > > > >> > > > >
> > > > > > >> > > > > struct TimestampType {
> > > > > > >> > > > > 1: required bool isAdjustedToUTC
> > > > > > >> > > > > 2: required TimeUnit unit
> > > > > > >> > > > > }
> > > > > > >> > > > >
> > > > > > >> > > > > When using this new type, the ConvertedType field must also be
> > > > > > >> set for
> > > > > > >> > > > > forward compatibility (so that old readers can still
> > > > > understand
> > > > > > >> the
> > > > > > >> > > > > data), but parquet.thrift says
> > > > > > >> > > > >
> > > > > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > > >> > > > > true, unit = MICROS)
> > > > > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > > >> > > > > true, unit = MILLIS)
> > > > > > >> > > > > 8: TimestampType TIMESTAMP
> > > > > > >> > > > >
> > > > > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > > > > >> > > > >
> > > > > > >> > > > > * Timestamp without time zone (no UTC normalization indicated)
> > > > > > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > > > > > >> > > > >
> > > > > > >> > > > > Prior to the introduction of LogicalType, we would set either
> > > > > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > > > >> > > > > normalization. So when reading the data back, any notion of
> > > > > > >> having had
> > > > > > >> > > > > a time zone is lost (it could be stored in schema metadata if
> > > > > > >> > > > > desired).
> > > > > > >> > > > >
> > > > > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_
> > > > > when
> > > > > > >> > > > > isAdjustedToUTC is true creates a forward compatibility break
> > > > > in
> > > > > > >> this
> > > > > > >> > > > > regard. This was reported to us shortly after releasing Apache
> > > > > > >> Arrow
> > > > > > >> > > > > 0.14.0:
> > > > > > >> > > > >
> > > > > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > > > > >> > > > >
> > > > > > >> > > > > We are discussing setting the ConvertedType unconditionally in
> > > > > > >> > > > >
> > > > > > >> > > > > https://github.com/apache/arrow/pull/4825
> > > > > > >> > > > >
> > > > > > >> > > > > This might need to be a setting that is toggled when data is
> > > > > > >> coming
> > > > > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> > > > > > >> intended
> > > > > > >> > > > > forward compatibility interpretation, and if not should we
> > > > > amend.
> > > > > > >> > > > >
> > > > > > >> > > > > Thanks,
> > > > > > >> > > > > Wes
> > > > > > >> > > > >
> > > > > > >> > >
> > > > > > >>
> > > > > > >
> > > > >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Wes McKinney <we...@gmail.com>.
On Thu, Jul 11, 2019 at 8:17 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi Wes,
>
> I did a little bit of testing using pyarrow 0.14.0. I know that this
> is not the latest state of the code, but my understanding is that the
> only planned change is that 0.14.1 will add the legacy types even for
> local semantics. Here is what I did:
>
>     In [1]: import pandas as pd
>        ...: from datetime import datetime
>        ...: from pandas import Timestamp
>        ...: from pytz import timezone
>        ...: ts_dt = datetime(1970, 1, 1)
>        ...: ts_pd = Timestamp(year=1970, month=1, day=1)
>        ...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1,
> tz=timezone("Europe/Paris"))
>        ...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1,
> hour=2, tz=timezone("Europe/Helsinki"))
>        ...: df = pd.DataFrame({
>        ...:     'datetime': [ts_dt, None],
>        ...:     'pd_no_tz': [ts_pd, None],
>        ...:     'pd_paris': [ts_pd_paris, None],
>        ...:     'pd_helsinki': [ts_pd_helsinki, None],
>        ...:     'pd_mixed': [ts_pd_paris, ts_pd_helsinki]
>        ...: })
>        ...: df.to_parquet('test.parquet')
>        ...: df
>     Out[1]:
>         datetime   pd_no_tz                  pd_paris
> pd_helsinki                   pd_mixed
>     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> 02:00:00+02:00  1970-01-01 01:00:00+01:00
>     1        NaT        NaT                       NaT
>      NaT  1970-01-01 02:00:00+02:00
>
> I picked these values because I expected all of these timestamps to be
> saved as the numeric value 0. Checking the metadata using
> parquet-tools:
>
>     > parquet-tools meta test.parquet
>     datetime:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
>     pd_no_tz:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
>     pd_paris:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
>     pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
>     pd_mixed:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
>
> This matched my expectations up until pd_mixed. I was surprised to see
> that timestamps with mixed time zones were be stored using local
> semantics instead of being normalized to UTC, but if it's an
> intentional choice, I'm fine with it as long as the numbers are
> correct:
>
>     > parquet-tools head test.parquet
>     datetime = 0
>     pd_no_tz = 0
>     pd_paris = 0
>     pd_helsinki = 0
>     pd_mixed = 3600000
>     pd_mixed = 7200000
>
> The numbers for the mixed column are not 0, but that is just the
> result of not using UTC-normalized semantics there. The values can be
> correctly interpreted by parquet-tools:
>
>     > parquet-tools dump test.parquet
>     INT64 datetime
>     --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_no_tz
>     --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_paris
>     --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_helsinki
>     --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
>     value 2: R:0 D:0 V:<null>
>
>     INT64 pd_mixed
>     --------------------------------------------------------------------------------
>     *** row group 1 of 1, values 1 to 2 ***
>     value 1: R:0 D:1 V:1970-01-01T01:00:00.000
>     value 2: R:0 D:1 V:1970-01-01T02:00:00.000
>
> And naturally by pandas as well:
>
>     In [2]: df2 = pd.read_parquet('test.parquet')
>        ...: df2
>     Out[2]:
>         datetime   pd_no_tz                  pd_paris
> pd_helsinki            pd_mixed
>     0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
> 02:00:00+02:00 1970-01-01 01:00:00
>     1        NaT        NaT                       NaT
>      NaT 1970-01-01 02:00:00
>
> Finally, let's try an older version of parquet-tools as well:
>
>     > parquet-tools meta test.parquet
>     datetime:    OPTIONAL INT64 R:0 D:1
>     pd_no_tz:    OPTIONAL INT64 R:0 D:1
>     pd_paris:    OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
>     pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
>     pd_mixed:    OPTIONAL INT64 R:0 D:1
>

Just to confirm that in pyarrow 0.13.0 (which tracks
parquet-cpp-in-development 1.6.0, prior to introduction of
LogicalType), TIMESTAMP_* is set for all 5 columns. This is the
forward compatibility that we are planning to restore.

> This confirms that the legacy types are only written for
> UTC-normalized timestamps in 0.14.0.
>
> In summary, when saving two timestamps from different time zones that
> refer to the same instant, I would have expected pyarrow to normalize
> to UTC and thereby sacrifice the local representations instead of
> saving using local semantics and thereby sacrificing the instants. I
> don't know whether that is the intended behaviour, but in any case,
> based on this short manual testing, the new timestamp types written by
> pyarrow are interopable with the Java library.
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 4:30 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > Correct
> >
> > On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
> > >
> > > Hi Wes,
> > >
> > > Do you mean that the new logical types have already been released in 0.14.0
> > > and a 0.14.1 is needed ASAP to fix this regression?
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> > > > issues and others we need to make a new release within the next 7-10
> > > > days. We can point you to nightly Python builds to make testing for
> > > > you easier so you don't have to build the project yourself.
> > > >
> > > > - Wes
> > > >
> > > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > > wrote:
> > > > >
> > > > > Hi,
> > > > >
> > > > > Oh, and one more thing: Before releasing the next Arrow version
> > > > > incorporating the new logical types, we should definitely test that their
> > > > > behaviour matches that of parquet-mr. When is the next release planned to
> > > > > come out?
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
> > > > >
> > > > > > Hi Wes,
> > > > > >
> > > > > > Yes, I agree that we should do that, but then we have a problem of
> > > > what to
> > > > > > do in the other direction, i.e. when we use the new logical types API
> > > > to
> > > > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > > > > normalized flag? Tim has started a discussion about that, suggesting
> > > > to use
> > > > > > three states that I just answered.
> > > > > >
> > > > > > Br,
> > > > > >
> > > > > > Zoltan
> > > > > >
> > > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com>
> > > > wrote:
> > > > > >
> > > > > >> Thank for the comments.
> > > > > >>
> > > > > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > > > > >> types to maintain forward compatibility and stay consistent with what
> > > > > >> we were doing in the C++ library prior to the introduction of the
> > > > > >> LogicalType metadata.
> > > > > >>
> > > > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > > >
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > Hi Wes,
> > > > > >> >
> > > > > >> > Both of the semantics are deterministic in one aspect and
> > > > > >> indeterministic
> > > > > >> > in another. Timestamps of instant semantic will always refer to the
> > > > same
> > > > > >> > instant, but their user-facing representation (how they get
> > > > displayed)
> > > > > >> > depends on the user's time zone. Timestamps of local semantics
> > > > always
> > > > > >> have
> > > > > >> > the same user-facing representation but the instant they refer to is
> > > > > >> > undefined (or ambigous, depending on your point of view).
> > > > > >> >
> > > > > >> > My understanding is that Spark uses instant semantics, i.e.,
> > > > timestamps
> > > > > >> are
> > > > > >> > stored normalized to UTC and are displayed adjusted to the user's
> > > > local
> > > > > >> > time zone.
> > > > > >> >
> > > > > >> > Br,
> > > > > >> >
> > > > > >> > Zoltan
> > > > > >> >
> > > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com>
> > > > > >> wrote:
> > > > > >> >
> > > > > >> > > Thanks Zoltan.
> > > > > >> > >
> > > > > >> > > This is definitely a tricky issue.
> > > > > >> > >
> > > > > >> > > Spark's application of localtime semantics to timestamp data has
> > > > been
> > > > > >> > > a source of issues for many people. Personally I don't find that
> > > > > >> > > behavior to be particularly helpful since depending on the session
> > > > > >> > > time zone, you will get different results on data not marked as
> > > > > >> > > UTC-normalized.
> > > > > >> > >
> > > > > >> > > In pandas, as contrast, we apply UTC semantics to
> > > > > >> > > naive/not-explicitly-normalized data so at least the code produces
> > > > > >> > > deterministic results on all environments. That seems strictly
> > > > better
> > > > > >> > > to me -- if you want a localized interpretation of naive data, you
> > > > > >> > > must opt into this rather than having it automatically selected
> > > > based
> > > > > >> > > on your locale. The instances of people shooting their toes off
> > > > due to
> > > > > >> > > time zones are practically non-existent, whereas I'm hearing about
> > > > > >> > > Spark gotchas all the time.
> > > > > >> > >
> > > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> > > > <zi@cloudera.com.invalid
> > > > > >> >
> > > > > >> > > wrote:
> > > > > >> > > >
> > > > > >> > > > Hi Wes,
> > > > > >> > > >
> > > > > >> > > > The rules for TIMESTAMP forward-compatibility were created
> > > > based on
> > > > > >> the
> > > > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > > > > >> been used
> > > > > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > > > > >> assumption was
> > > > > >> > > > supported by two sources:
> > > > > >> > > >
> > > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> > > > and
> > > > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> > > > since
> > > > > >> the
> > > > > >> > > Unix
> > > > > >> > > > epoch, an instant specified in UTC, from which it follows that
> > > > they
> > > > > >> have
> > > > > >> > > > instant semantics (because timestamps of local semantics do not
> > > > > >> > > correspond
> > > > > >> > > > to a single instant).
> > > > > >> > > >
> > > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> > > > component
> > > > > >> that
> > > > > >> > > > used these types differently from the specification.
> > > > > >> > > >
> > > > > >> > > > Based on your e-mail, we were wrong on #2.
> > > > > >> > > >
> > > > > >> > > > From this false premise it followed that TIMESTAMPs with local
> > > > > >> semantics
> > > > > >> > > > were a new type and did not need to be annotated with the old
> > > > types
> > > > > >> to
> > > > > >> > > > maintain compatibility. In fact, annotating them with the old
> > > > types
> > > > > >> were
> > > > > >> > > > considered to be harmful, since it would have mislead older
> > > > readers
> > > > > >> into
> > > > > >> > > > thinking that they can read TIMESTAMPs with local semantics,
> > > > when in
> > > > > >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> > > > > >> instant
> > > > > >> > > > semantics. This would have lead to a difference of several
> > > > hours,
> > > > > >> > > > corresponding to the time zone offset.
> > > > > >> > > >
> > > > > >> > > > In the light of your e-mail, this misinterpretation of
> > > > timestamps
> > > > > >> may
> > > > > >> > > > already be happening, since if Arrow annotates local timestamps
> > > > with
> > > > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> > > > misinterprets
> > > > > >> them
> > > > > >> > > as
> > > > > >> > > > timestamps with instant semantics, leading to a difference of
> > > > > >> several
> > > > > >> > > hours.
> > > > > >> > > >
> > > > > >> > > > Based on this, I think it would make sense from Arrow's point of
> > > > > >> view to
> > > > > >> > > > annotate both semantics with the old types, since that is its
> > > > > >> historical
> > > > > >> > > > behaviour and keeping it up is needed for maintaining
> > > > compatibilty.
> > > > > >> I'm
> > > > > >> > > not
> > > > > >> > > > so sure about the Java library though, since as far as I know,
> > > > these
> > > > > >> > > types
> > > > > >> > > > were never used in the local sense there (although I may be
> > > > wrong
> > > > > >> again).
> > > > > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > > > > >> differently in
> > > > > >> > > > this aspect though, it may be tricky to convey this distinction
> > > > in
> > > > > >> the
> > > > > >> > > > specification. I would be interested in hearing your and other
> > > > > >> > > developers'
> > > > > >> > > > opinions on this.
> > > > > >> > > >
> > > > > >> > > > Thanks,
> > > > > >> > > >
> > > > > >> > > > Zoltan
> > > > > >> > > >
> > > > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> > > > wesmckinn@gmail.com>
> > > > > >> wrote:
> > > > > >> > > >
> > > > > >> > > > > hi folks,
> > > > > >> > > > >
> > > > > >> > > > > We have just recently implemented the new LogicalType unions
> > > > in
> > > > > >> the
> > > > > >> > > > > Parquet C++ library and we have run into a forward
> > > > compatibility
> > > > > >> > > > > problem with reader versions prior to this implementation.
> > > > > >> > > > >
> > > > > >> > > > > To recap the issue, prior to the introduction of LogicalType,
> > > > the
> > > > > >> > > > > Parquet format had no explicit notion of time zones or UTC
> > > > > >> > > > > normalization. The new TimestampType provides a flag to
> > > > indicate
> > > > > >> > > > > UTC-normalization
> > > > > >> > > > >
> > > > > >> > > > > struct TimestampType {
> > > > > >> > > > > 1: required bool isAdjustedToUTC
> > > > > >> > > > > 2: required TimeUnit unit
> > > > > >> > > > > }
> > > > > >> > > > >
> > > > > >> > > > > When using this new type, the ConvertedType field must also be
> > > > > >> set for
> > > > > >> > > > > forward compatibility (so that old readers can still
> > > > understand
> > > > > >> the
> > > > > >> > > > > data), but parquet.thrift says
> > > > > >> > > > >
> > > > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > >> > > > > true, unit = MICROS)
> > > > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > > >> > > > > true, unit = MILLIS)
> > > > > >> > > > > 8: TimestampType TIMESTAMP
> > > > > >> > > > >
> > > > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > > > >> > > > >
> > > > > >> > > > > * Timestamp without time zone (no UTC normalization indicated)
> > > > > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > > > > >> > > > >
> > > > > >> > > > > Prior to the introduction of LogicalType, we would set either
> > > > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > > >> > > > > normalization. So when reading the data back, any notion of
> > > > > >> having had
> > > > > >> > > > > a time zone is lost (it could be stored in schema metadata if
> > > > > >> > > > > desired).
> > > > > >> > > > >
> > > > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_
> > > > when
> > > > > >> > > > > isAdjustedToUTC is true creates a forward compatibility break
> > > > in
> > > > > >> this
> > > > > >> > > > > regard. This was reported to us shortly after releasing Apache
> > > > > >> Arrow
> > > > > >> > > > > 0.14.0:
> > > > > >> > > > >
> > > > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > > > >> > > > >
> > > > > >> > > > > We are discussing setting the ConvertedType unconditionally in
> > > > > >> > > > >
> > > > > >> > > > > https://github.com/apache/arrow/pull/4825
> > > > > >> > > > >
> > > > > >> > > > > This might need to be a setting that is toggled when data is
> > > > > >> coming
> > > > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> > > > > >> intended
> > > > > >> > > > > forward compatibility interpretation, and if not should we
> > > > amend.
> > > > > >> > > > >
> > > > > >> > > > > Thanks,
> > > > > >> > > > > Wes
> > > > > >> > > > >
> > > > > >> > >
> > > > > >>
> > > > > >
> > > >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Joris Van den Bossche <jo...@gmail.com>.
Ah, that is not that unexpected to me. Pandas stores the data as object
dtype (to preserve the mixed types, it is up to the user to force a certain
conversion), and so for object dtype data you need to do inference to
determine which arrow or parquet type to use (in contrast to the other,
non-mixed columns, which in pandas use a designated datetime64 dtype).
And inference can always be somewhat subjective or incomplete, and here
pyarrow and fastparquet seem to make a different decision regarding this
inference (where pyarrow is arguably making a wrong decision).

Joris

Op do 11 jul. 2019 om 10:04 schreef Zoltan Ivanfi <zi...@cloudera.com.invalid>:

> Hi Joris,
>
> Out of curiosity I tried it with fastparquet as well and that couldn't
> even save that column:
>
> ValueError: Can't infer object conversion type: 0    1970-01-01
> 01:00:00+01:00
> 1    1970-01-01 02:00:00+02:00
> Name: pd_mixed, dtype: object
>
> Br,
>
> Zoltan
>
> On Thu, Jul 11, 2019 at 3:55 PM Joris Van den Bossche
> <jo...@gmail.com> wrote:
> >
> > Created an issue for the mixed timezones here:
> > https://issues.apache.org/jira/browse/ARROW-5912
> >
> > Op do 11 jul. 2019 om 09:42 schreef Joris Van den Bossche <
> > jorisvandenbossche@gmail.com>:
> >
> > > Clarification regarding the mixed types (this is in the end not really
> > > related to parquet, but to how pandas gets converted to pyarrow)
> > >
> > > Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > > >:
> > >
> > >> ...
> > >> This matched my expectations up until pd_mixed. I was surprised to see
> > >> that timestamps with mixed time zones were be stored using local
> > >> semantics instead of being normalized to UTC,
> > >>
> > >
> > > For the actual parquet writing semantics, it is more relevant to look
> at
> > > the arrow Table that gets created from this DataFrame:
> > >
> > > In [20]: pa.Table.from_pandas(df)
> > > Out[20]:
> > > pyarrow.Table
> > > datetime: timestamp[ns]
> > > pd_no_tz: timestamp[ns]
> > > pd_paris: timestamp[ns, tz=Europe/Paris]
> > > pd_helsinki: timestamp[ns, tz=Europe/Helsinki]
> > > pd_mixed: timestamp[us]
> > >
> > > For all columns except for pd_mixed the result is clear and expected,
> but
> > > apparently pyarrow converts to the mixed timestamps to a TimestampArray
> > > without timezone using the "local times", and not the UTC normalized
> times.
> > >
> > > Now, that certainly feels a bit buggy to me (or at least unexpected).
> But,
> > > this is an issue for the python -> arrow conversion, not related to the
> > > actual parquet writing. I will open a separate JIRA for this.
> > >
> > > Joris
> > >
>

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Joris,

Out of curiosity I tried it with fastparquet as well and that couldn't
even save that column:

ValueError: Can't infer object conversion type: 0    1970-01-01 01:00:00+01:00
1    1970-01-01 02:00:00+02:00
Name: pd_mixed, dtype: object

Br,

Zoltan

On Thu, Jul 11, 2019 at 3:55 PM Joris Van den Bossche
<jo...@gmail.com> wrote:
>
> Created an issue for the mixed timezones here:
> https://issues.apache.org/jira/browse/ARROW-5912
>
> Op do 11 jul. 2019 om 09:42 schreef Joris Van den Bossche <
> jorisvandenbossche@gmail.com>:
>
> > Clarification regarding the mixed types (this is in the end not really
> > related to parquet, but to how pandas gets converted to pyarrow)
> >
> > Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi <zi@cloudera.com.invalid
> > >:
> >
> >> ...
> >> This matched my expectations up until pd_mixed. I was surprised to see
> >> that timestamps with mixed time zones were be stored using local
> >> semantics instead of being normalized to UTC,
> >>
> >
> > For the actual parquet writing semantics, it is more relevant to look at
> > the arrow Table that gets created from this DataFrame:
> >
> > In [20]: pa.Table.from_pandas(df)
> > Out[20]:
> > pyarrow.Table
> > datetime: timestamp[ns]
> > pd_no_tz: timestamp[ns]
> > pd_paris: timestamp[ns, tz=Europe/Paris]
> > pd_helsinki: timestamp[ns, tz=Europe/Helsinki]
> > pd_mixed: timestamp[us]
> >
> > For all columns except for pd_mixed the result is clear and expected, but
> > apparently pyarrow converts to the mixed timestamps to a TimestampArray
> > without timezone using the "local times", and not the UTC normalized times.
> >
> > Now, that certainly feels a bit buggy to me (or at least unexpected). But,
> > this is an issue for the python -> arrow conversion, not related to the
> > actual parquet writing. I will open a separate JIRA for this.
> >
> > Joris
> >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Joris Van den Bossche <jo...@gmail.com>.
Created an issue for the mixed timezones here:
https://issues.apache.org/jira/browse/ARROW-5912

Op do 11 jul. 2019 om 09:42 schreef Joris Van den Bossche <
jorisvandenbossche@gmail.com>:

> Clarification regarding the mixed types (this is in the end not really
> related to parquet, but to how pandas gets converted to pyarrow)
>
> Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi <zi@cloudera.com.invalid
> >:
>
>> ...
>> This matched my expectations up until pd_mixed. I was surprised to see
>> that timestamps with mixed time zones were be stored using local
>> semantics instead of being normalized to UTC,
>>
>
> For the actual parquet writing semantics, it is more relevant to look at
> the arrow Table that gets created from this DataFrame:
>
> In [20]: pa.Table.from_pandas(df)
> Out[20]:
> pyarrow.Table
> datetime: timestamp[ns]
> pd_no_tz: timestamp[ns]
> pd_paris: timestamp[ns, tz=Europe/Paris]
> pd_helsinki: timestamp[ns, tz=Europe/Helsinki]
> pd_mixed: timestamp[us]
>
> For all columns except for pd_mixed the result is clear and expected, but
> apparently pyarrow converts to the mixed timestamps to a TimestampArray
> without timezone using the "local times", and not the UTC normalized times.
>
> Now, that certainly feels a bit buggy to me (or at least unexpected). But,
> this is an issue for the python -> arrow conversion, not related to the
> actual parquet writing. I will open a separate JIRA for this.
>
> Joris
>

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Joris Van den Bossche <jo...@gmail.com>.
Clarification regarding the mixed types (this is in the end not really
related to parquet, but to how pandas gets converted to pyarrow)

Op do 11 jul. 2019 om 09:17 schreef Zoltan Ivanfi <zi...@cloudera.com.invalid>:

> ...
> This matched my expectations up until pd_mixed. I was surprised to see
> that timestamps with mixed time zones were be stored using local
> semantics instead of being normalized to UTC,
>

For the actual parquet writing semantics, it is more relevant to look at
the arrow Table that gets created from this DataFrame:

In [20]: pa.Table.from_pandas(df)
Out[20]:
pyarrow.Table
datetime: timestamp[ns]
pd_no_tz: timestamp[ns]
pd_paris: timestamp[ns, tz=Europe/Paris]
pd_helsinki: timestamp[ns, tz=Europe/Helsinki]
pd_mixed: timestamp[us]

For all columns except for pd_mixed the result is clear and expected, but
apparently pyarrow converts to the mixed timestamps to a TimestampArray
without timezone using the "local times", and not the UTC normalized times.

Now, that certainly feels a bit buggy to me (or at least unexpected). But,
this is an issue for the python -> arrow conversion, not related to the
actual parquet writing. I will open a separate JIRA for this.

Joris

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Wes,

I did a little bit of testing using pyarrow 0.14.0. I know that this
is not the latest state of the code, but my understanding is that the
only planned change is that 0.14.1 will add the legacy types even for
local semantics. Here is what I did:

    In [1]: import pandas as pd
       ...: from datetime import datetime
       ...: from pandas import Timestamp
       ...: from pytz import timezone
       ...: ts_dt = datetime(1970, 1, 1)
       ...: ts_pd = Timestamp(year=1970, month=1, day=1)
       ...: ts_pd_paris = Timestamp(year=1970, month=1, day=1, hour=1,
tz=timezone("Europe/Paris"))
       ...: ts_pd_helsinki = Timestamp(year=1970, month=1, day=1,
hour=2, tz=timezone("Europe/Helsinki"))
       ...: df = pd.DataFrame({
       ...:     'datetime': [ts_dt, None],
       ...:     'pd_no_tz': [ts_pd, None],
       ...:     'pd_paris': [ts_pd_paris, None],
       ...:     'pd_helsinki': [ts_pd_helsinki, None],
       ...:     'pd_mixed': [ts_pd_paris, ts_pd_helsinki]
       ...: })
       ...: df.to_parquet('test.parquet')
       ...: df
    Out[1]:
        datetime   pd_no_tz                  pd_paris
pd_helsinki                   pd_mixed
    0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
02:00:00+02:00  1970-01-01 01:00:00+01:00
    1        NaT        NaT                       NaT
     NaT  1970-01-01 02:00:00+02:00

I picked these values because I expected all of these timestamps to be
saved as the numeric value 0. Checking the metadata using
parquet-tools:

    > parquet-tools meta test.parquet
    datetime:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
    pd_no_tz:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1
    pd_paris:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
    pd_helsinki: OPTIONAL INT64 L:TIMESTAMP(MILLIS,true) R:0 D:1
    pd_mixed:    OPTIONAL INT64 L:TIMESTAMP(MILLIS,false) R:0 D:1

This matched my expectations up until pd_mixed. I was surprised to see
that timestamps with mixed time zones were be stored using local
semantics instead of being normalized to UTC, but if it's an
intentional choice, I'm fine with it as long as the numbers are
correct:

    > parquet-tools head test.parquet
    datetime = 0
    pd_no_tz = 0
    pd_paris = 0
    pd_helsinki = 0
    pd_mixed = 3600000
    pd_mixed = 7200000

The numbers for the mixed column are not 0, but that is just the
result of not using UTC-normalized semantics there. The values can be
correctly interpreted by parquet-tools:

    > parquet-tools dump test.parquet
    INT64 datetime
    --------------------------------------------------------------------------------
    *** row group 1 of 1, values 1 to 2 ***
    value 1: R:0 D:1 V:1970-01-01T00:00:00.000
    value 2: R:0 D:0 V:<null>

    INT64 pd_no_tz
    --------------------------------------------------------------------------------
    *** row group 1 of 1, values 1 to 2 ***
    value 1: R:0 D:1 V:1970-01-01T00:00:00.000
    value 2: R:0 D:0 V:<null>

    INT64 pd_paris
    --------------------------------------------------------------------------------
    *** row group 1 of 1, values 1 to 2 ***
    value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
    value 2: R:0 D:0 V:<null>

    INT64 pd_helsinki
    --------------------------------------------------------------------------------
    *** row group 1 of 1, values 1 to 2 ***
    value 1: R:0 D:1 V:1970-01-01T00:00:00.000+0000
    value 2: R:0 D:0 V:<null>

    INT64 pd_mixed
    --------------------------------------------------------------------------------
    *** row group 1 of 1, values 1 to 2 ***
    value 1: R:0 D:1 V:1970-01-01T01:00:00.000
    value 2: R:0 D:1 V:1970-01-01T02:00:00.000

And naturally by pandas as well:

    In [2]: df2 = pd.read_parquet('test.parquet')
       ...: df2
    Out[2]:
        datetime   pd_no_tz                  pd_paris
pd_helsinki            pd_mixed
    0 1970-01-01 1970-01-01 1970-01-01 01:00:00+01:00 1970-01-01
02:00:00+02:00 1970-01-01 01:00:00
    1        NaT        NaT                       NaT
     NaT 1970-01-01 02:00:00

Finally, let's try an older version of parquet-tools as well:

    > parquet-tools meta test.parquet
    datetime:    OPTIONAL INT64 R:0 D:1
    pd_no_tz:    OPTIONAL INT64 R:0 D:1
    pd_paris:    OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
    pd_helsinki: OPTIONAL INT64 O:TIMESTAMP_MILLIS R:0 D:1
    pd_mixed:    OPTIONAL INT64 R:0 D:1

This confirms that the legacy types are only written for
UTC-normalized timestamps in 0.14.0.

In summary, when saving two timestamps from different time zones that
refer to the same instant, I would have expected pyarrow to normalize
to UTC and thereby sacrifice the local representations instead of
saving using local semantics and thereby sacrificing the instants. I
don't know whether that is the intended behaviour, but in any case,
based on this short manual testing, the new timestamp types written by
pyarrow are interopable with the Java library.

Br,

Zoltan

On Wed, Jul 10, 2019 at 4:30 PM Wes McKinney <we...@gmail.com> wrote:
>
> Correct
>
> On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
> >
> > Hi Wes,
> >
> > Do you mean that the new logical types have already been released in 0.14.0
> > and a 0.14.1 is needed ASAP to fix this regression?
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> > > issues and others we need to make a new release within the next 7-10
> > > days. We can point you to nightly Python builds to make testing for
> > > you easier so you don't have to build the project yourself.
> > >
> > > - Wes
> > >
> > > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > > wrote:
> > > >
> > > > Hi,
> > > >
> > > > Oh, and one more thing: Before releasing the next Arrow version
> > > > incorporating the new logical types, we should definitely test that their
> > > > behaviour matches that of parquet-mr. When is the next release planned to
> > > > come out?
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
> > > >
> > > > > Hi Wes,
> > > > >
> > > > > Yes, I agree that we should do that, but then we have a problem of
> > > what to
> > > > > do in the other direction, i.e. when we use the new logical types API
> > > to
> > > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > > > normalized flag? Tim has started a discussion about that, suggesting
> > > to use
> > > > > three states that I just answered.
> > > > >
> > > > > Br,
> > > > >
> > > > > Zoltan
> > > > >
> > > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com>
> > > wrote:
> > > > >
> > > > >> Thank for the comments.
> > > > >>
> > > > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > > > >> types to maintain forward compatibility and stay consistent with what
> > > > >> we were doing in the C++ library prior to the introduction of the
> > > > >> LogicalType metadata.
> > > > >>
> > > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> > > >
> > > > >> wrote:
> > > > >> >
> > > > >> > Hi Wes,
> > > > >> >
> > > > >> > Both of the semantics are deterministic in one aspect and
> > > > >> indeterministic
> > > > >> > in another. Timestamps of instant semantic will always refer to the
> > > same
> > > > >> > instant, but their user-facing representation (how they get
> > > displayed)
> > > > >> > depends on the user's time zone. Timestamps of local semantics
> > > always
> > > > >> have
> > > > >> > the same user-facing representation but the instant they refer to is
> > > > >> > undefined (or ambigous, depending on your point of view).
> > > > >> >
> > > > >> > My understanding is that Spark uses instant semantics, i.e.,
> > > timestamps
> > > > >> are
> > > > >> > stored normalized to UTC and are displayed adjusted to the user's
> > > local
> > > > >> > time zone.
> > > > >> >
> > > > >> > Br,
> > > > >> >
> > > > >> > Zoltan
> > > > >> >
> > > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com>
> > > > >> wrote:
> > > > >> >
> > > > >> > > Thanks Zoltan.
> > > > >> > >
> > > > >> > > This is definitely a tricky issue.
> > > > >> > >
> > > > >> > > Spark's application of localtime semantics to timestamp data has
> > > been
> > > > >> > > a source of issues for many people. Personally I don't find that
> > > > >> > > behavior to be particularly helpful since depending on the session
> > > > >> > > time zone, you will get different results on data not marked as
> > > > >> > > UTC-normalized.
> > > > >> > >
> > > > >> > > In pandas, as contrast, we apply UTC semantics to
> > > > >> > > naive/not-explicitly-normalized data so at least the code produces
> > > > >> > > deterministic results on all environments. That seems strictly
> > > better
> > > > >> > > to me -- if you want a localized interpretation of naive data, you
> > > > >> > > must opt into this rather than having it automatically selected
> > > based
> > > > >> > > on your locale. The instances of people shooting their toes off
> > > due to
> > > > >> > > time zones are practically non-existent, whereas I'm hearing about
> > > > >> > > Spark gotchas all the time.
> > > > >> > >
> > > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> > > <zi@cloudera.com.invalid
> > > > >> >
> > > > >> > > wrote:
> > > > >> > > >
> > > > >> > > > Hi Wes,
> > > > >> > > >
> > > > >> > > > The rules for TIMESTAMP forward-compatibility were created
> > > based on
> > > > >> the
> > > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > > > >> been used
> > > > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > > > >> assumption was
> > > > >> > > > supported by two sources:
> > > > >> > > >
> > > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> > > and
> > > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> > > since
> > > > >> the
> > > > >> > > Unix
> > > > >> > > > epoch, an instant specified in UTC, from which it follows that
> > > they
> > > > >> have
> > > > >> > > > instant semantics (because timestamps of local semantics do not
> > > > >> > > correspond
> > > > >> > > > to a single instant).
> > > > >> > > >
> > > > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> > > component
> > > > >> that
> > > > >> > > > used these types differently from the specification.
> > > > >> > > >
> > > > >> > > > Based on your e-mail, we were wrong on #2.
> > > > >> > > >
> > > > >> > > > From this false premise it followed that TIMESTAMPs with local
> > > > >> semantics
> > > > >> > > > were a new type and did not need to be annotated with the old
> > > types
> > > > >> to
> > > > >> > > > maintain compatibility. In fact, annotating them with the old
> > > types
> > > > >> were
> > > > >> > > > considered to be harmful, since it would have mislead older
> > > readers
> > > > >> into
> > > > >> > > > thinking that they can read TIMESTAMPs with local semantics,
> > > when in
> > > > >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> > > > >> instant
> > > > >> > > > semantics. This would have lead to a difference of several
> > > hours,
> > > > >> > > > corresponding to the time zone offset.
> > > > >> > > >
> > > > >> > > > In the light of your e-mail, this misinterpretation of
> > > timestamps
> > > > >> may
> > > > >> > > > already be happening, since if Arrow annotates local timestamps
> > > with
> > > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> > > misinterprets
> > > > >> them
> > > > >> > > as
> > > > >> > > > timestamps with instant semantics, leading to a difference of
> > > > >> several
> > > > >> > > hours.
> > > > >> > > >
> > > > >> > > > Based on this, I think it would make sense from Arrow's point of
> > > > >> view to
> > > > >> > > > annotate both semantics with the old types, since that is its
> > > > >> historical
> > > > >> > > > behaviour and keeping it up is needed for maintaining
> > > compatibilty.
> > > > >> I'm
> > > > >> > > not
> > > > >> > > > so sure about the Java library though, since as far as I know,
> > > these
> > > > >> > > types
> > > > >> > > > were never used in the local sense there (although I may be
> > > wrong
> > > > >> again).
> > > > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > > > >> differently in
> > > > >> > > > this aspect though, it may be tricky to convey this distinction
> > > in
> > > > >> the
> > > > >> > > > specification. I would be interested in hearing your and other
> > > > >> > > developers'
> > > > >> > > > opinions on this.
> > > > >> > > >
> > > > >> > > > Thanks,
> > > > >> > > >
> > > > >> > > > Zoltan
> > > > >> > > >
> > > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> > > wesmckinn@gmail.com>
> > > > >> wrote:
> > > > >> > > >
> > > > >> > > > > hi folks,
> > > > >> > > > >
> > > > >> > > > > We have just recently implemented the new LogicalType unions
> > > in
> > > > >> the
> > > > >> > > > > Parquet C++ library and we have run into a forward
> > > compatibility
> > > > >> > > > > problem with reader versions prior to this implementation.
> > > > >> > > > >
> > > > >> > > > > To recap the issue, prior to the introduction of LogicalType,
> > > the
> > > > >> > > > > Parquet format had no explicit notion of time zones or UTC
> > > > >> > > > > normalization. The new TimestampType provides a flag to
> > > indicate
> > > > >> > > > > UTC-normalization
> > > > >> > > > >
> > > > >> > > > > struct TimestampType {
> > > > >> > > > > 1: required bool isAdjustedToUTC
> > > > >> > > > > 2: required TimeUnit unit
> > > > >> > > > > }
> > > > >> > > > >
> > > > >> > > > > When using this new type, the ConvertedType field must also be
> > > > >> set for
> > > > >> > > > > forward compatibility (so that old readers can still
> > > understand
> > > > >> the
> > > > >> > > > > data), but parquet.thrift says
> > > > >> > > > >
> > > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > >> > > > > true, unit = MICROS)
> > > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > > > >> TIMESTAMP(isAdjustedToUTC =
> > > > >> > > > > true, unit = MILLIS)
> > > > >> > > > > 8: TimestampType TIMESTAMP
> > > > >> > > > >
> > > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > > >> > > > >
> > > > >> > > > > * Timestamp without time zone (no UTC normalization indicated)
> > > > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > > > >> > > > >
> > > > >> > > > > Prior to the introduction of LogicalType, we would set either
> > > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > >> > > > > normalization. So when reading the data back, any notion of
> > > > >> having had
> > > > >> > > > > a time zone is lost (it could be stored in schema metadata if
> > > > >> > > > > desired).
> > > > >> > > > >
> > > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_
> > > when
> > > > >> > > > > isAdjustedToUTC is true creates a forward compatibility break
> > > in
> > > > >> this
> > > > >> > > > > regard. This was reported to us shortly after releasing Apache
> > > > >> Arrow
> > > > >> > > > > 0.14.0:
> > > > >> > > > >
> > > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > > >> > > > >
> > > > >> > > > > We are discussing setting the ConvertedType unconditionally in
> > > > >> > > > >
> > > > >> > > > > https://github.com/apache/arrow/pull/4825
> > > > >> > > > >
> > > > >> > > > > This might need to be a setting that is toggled when data is
> > > > >> coming
> > > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> > > > >> intended
> > > > >> > > > > forward compatibility interpretation, and if not should we
> > > amend.
> > > > >> > > > >
> > > > >> > > > > Thanks,
> > > > >> > > > > Wes
> > > > >> > > > >
> > > > >> > >
> > > > >>
> > > > >
> > >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Wes McKinney <we...@gmail.com>.
Correct

On Wed, Jul 10, 2019 at 9:21 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi Wes,
>
> Do you mean that the new logical types have already been released in 0.14.0
> and a 0.14.1 is needed ASAP to fix this regression?
>
> Thanks,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> > issues and others we need to make a new release within the next 7-10
> > days. We can point you to nightly Python builds to make testing for
> > you easier so you don't have to build the project yourself.
> >
> > - Wes
> >
> > On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> > >
> > > Hi,
> > >
> > > Oh, and one more thing: Before releasing the next Arrow version
> > > incorporating the new logical types, we should definitely test that their
> > > behaviour matches that of parquet-mr. When is the next release planned to
> > > come out?
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
> > >
> > > > Hi Wes,
> > > >
> > > > Yes, I agree that we should do that, but then we have a problem of
> > what to
> > > > do in the other direction, i.e. when we use the new logical types API
> > to
> > > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > > normalized flag? Tim has started a discussion about that, suggesting
> > to use
> > > > three states that I just answered.
> > > >
> > > > Br,
> > > >
> > > > Zoltan
> > > >
> > > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com>
> > wrote:
> > > >
> > > >> Thank for the comments.
> > > >>
> > > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > > >> types to maintain forward compatibility and stay consistent with what
> > > >> we were doing in the C++ library prior to the introduction of the
> > > >> LogicalType metadata.
> > > >>
> > > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> > >
> > > >> wrote:
> > > >> >
> > > >> > Hi Wes,
> > > >> >
> > > >> > Both of the semantics are deterministic in one aspect and
> > > >> indeterministic
> > > >> > in another. Timestamps of instant semantic will always refer to the
> > same
> > > >> > instant, but their user-facing representation (how they get
> > displayed)
> > > >> > depends on the user's time zone. Timestamps of local semantics
> > always
> > > >> have
> > > >> > the same user-facing representation but the instant they refer to is
> > > >> > undefined (or ambigous, depending on your point of view).
> > > >> >
> > > >> > My understanding is that Spark uses instant semantics, i.e.,
> > timestamps
> > > >> are
> > > >> > stored normalized to UTC and are displayed adjusted to the user's
> > local
> > > >> > time zone.
> > > >> >
> > > >> > Br,
> > > >> >
> > > >> > Zoltan
> > > >> >
> > > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com>
> > > >> wrote:
> > > >> >
> > > >> > > Thanks Zoltan.
> > > >> > >
> > > >> > > This is definitely a tricky issue.
> > > >> > >
> > > >> > > Spark's application of localtime semantics to timestamp data has
> > been
> > > >> > > a source of issues for many people. Personally I don't find that
> > > >> > > behavior to be particularly helpful since depending on the session
> > > >> > > time zone, you will get different results on data not marked as
> > > >> > > UTC-normalized.
> > > >> > >
> > > >> > > In pandas, as contrast, we apply UTC semantics to
> > > >> > > naive/not-explicitly-normalized data so at least the code produces
> > > >> > > deterministic results on all environments. That seems strictly
> > better
> > > >> > > to me -- if you want a localized interpretation of naive data, you
> > > >> > > must opt into this rather than having it automatically selected
> > based
> > > >> > > on your locale. The instances of people shooting their toes off
> > due to
> > > >> > > time zones are practically non-existent, whereas I'm hearing about
> > > >> > > Spark gotchas all the time.
> > > >> > >
> > > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> > <zi@cloudera.com.invalid
> > > >> >
> > > >> > > wrote:
> > > >> > > >
> > > >> > > > Hi Wes,
> > > >> > > >
> > > >> > > > The rules for TIMESTAMP forward-compatibility were created
> > based on
> > > >> the
> > > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > > >> been used
> > > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > > >> assumption was
> > > >> > > > supported by two sources:
> > > >> > > >
> > > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> > and
> > > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> > since
> > > >> the
> > > >> > > Unix
> > > >> > > > epoch, an instant specified in UTC, from which it follows that
> > they
> > > >> have
> > > >> > > > instant semantics (because timestamps of local semantics do not
> > > >> > > correspond
> > > >> > > > to a single instant).
> > > >> > > >
> > > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> > component
> > > >> that
> > > >> > > > used these types differently from the specification.
> > > >> > > >
> > > >> > > > Based on your e-mail, we were wrong on #2.
> > > >> > > >
> > > >> > > > From this false premise it followed that TIMESTAMPs with local
> > > >> semantics
> > > >> > > > were a new type and did not need to be annotated with the old
> > types
> > > >> to
> > > >> > > > maintain compatibility. In fact, annotating them with the old
> > types
> > > >> were
> > > >> > > > considered to be harmful, since it would have mislead older
> > readers
> > > >> into
> > > >> > > > thinking that they can read TIMESTAMPs with local semantics,
> > when in
> > > >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> > > >> instant
> > > >> > > > semantics. This would have lead to a difference of several
> > hours,
> > > >> > > > corresponding to the time zone offset.
> > > >> > > >
> > > >> > > > In the light of your e-mail, this misinterpretation of
> > timestamps
> > > >> may
> > > >> > > > already be happening, since if Arrow annotates local timestamps
> > with
> > > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> > misinterprets
> > > >> them
> > > >> > > as
> > > >> > > > timestamps with instant semantics, leading to a difference of
> > > >> several
> > > >> > > hours.
> > > >> > > >
> > > >> > > > Based on this, I think it would make sense from Arrow's point of
> > > >> view to
> > > >> > > > annotate both semantics with the old types, since that is its
> > > >> historical
> > > >> > > > behaviour and keeping it up is needed for maintaining
> > compatibilty.
> > > >> I'm
> > > >> > > not
> > > >> > > > so sure about the Java library though, since as far as I know,
> > these
> > > >> > > types
> > > >> > > > were never used in the local sense there (although I may be
> > wrong
> > > >> again).
> > > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > > >> differently in
> > > >> > > > this aspect though, it may be tricky to convey this distinction
> > in
> > > >> the
> > > >> > > > specification. I would be interested in hearing your and other
> > > >> > > developers'
> > > >> > > > opinions on this.
> > > >> > > >
> > > >> > > > Thanks,
> > > >> > > >
> > > >> > > > Zoltan
> > > >> > > >
> > > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> > wesmckinn@gmail.com>
> > > >> wrote:
> > > >> > > >
> > > >> > > > > hi folks,
> > > >> > > > >
> > > >> > > > > We have just recently implemented the new LogicalType unions
> > in
> > > >> the
> > > >> > > > > Parquet C++ library and we have run into a forward
> > compatibility
> > > >> > > > > problem with reader versions prior to this implementation.
> > > >> > > > >
> > > >> > > > > To recap the issue, prior to the introduction of LogicalType,
> > the
> > > >> > > > > Parquet format had no explicit notion of time zones or UTC
> > > >> > > > > normalization. The new TimestampType provides a flag to
> > indicate
> > > >> > > > > UTC-normalization
> > > >> > > > >
> > > >> > > > > struct TimestampType {
> > > >> > > > > 1: required bool isAdjustedToUTC
> > > >> > > > > 2: required TimeUnit unit
> > > >> > > > > }
> > > >> > > > >
> > > >> > > > > When using this new type, the ConvertedType field must also be
> > > >> set for
> > > >> > > > > forward compatibility (so that old readers can still
> > understand
> > > >> the
> > > >> > > > > data), but parquet.thrift says
> > > >> > > > >
> > > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > > >> TIMESTAMP(isAdjustedToUTC =
> > > >> > > > > true, unit = MICROS)
> > > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > > >> TIMESTAMP(isAdjustedToUTC =
> > > >> > > > > true, unit = MILLIS)
> > > >> > > > > 8: TimestampType TIMESTAMP
> > > >> > > > >
> > > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > >> > > > >
> > > >> > > > > * Timestamp without time zone (no UTC normalization indicated)
> > > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > > >> > > > >
> > > >> > > > > Prior to the introduction of LogicalType, we would set either
> > > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > >> > > > > normalization. So when reading the data back, any notion of
> > > >> having had
> > > >> > > > > a time zone is lost (it could be stored in schema metadata if
> > > >> > > > > desired).
> > > >> > > > >
> > > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_
> > when
> > > >> > > > > isAdjustedToUTC is true creates a forward compatibility break
> > in
> > > >> this
> > > >> > > > > regard. This was reported to us shortly after releasing Apache
> > > >> Arrow
> > > >> > > > > 0.14.0:
> > > >> > > > >
> > > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > >> > > > >
> > > >> > > > > We are discussing setting the ConvertedType unconditionally in
> > > >> > > > >
> > > >> > > > > https://github.com/apache/arrow/pull/4825
> > > >> > > > >
> > > >> > > > > This might need to be a setting that is toggled when data is
> > > >> coming
> > > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> > > >> intended
> > > >> > > > > forward compatibility interpretation, and if not should we
> > amend.
> > > >> > > > >
> > > >> > > > > Thanks,
> > > >> > > > > Wes
> > > >> > > > >
> > > >> > >
> > > >>
> > > >
> >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Wes,

Do you mean that the new logical types have already been released in 0.14.0
and a 0.14.1 is needed ASAP to fix this regression?

Thanks,

Zoltan

On Wed, Jul 10, 2019 at 4:13 PM Wes McKinney <we...@gmail.com> wrote:

> hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
> issues and others we need to make a new release within the next 7-10
> days. We can point you to nightly Python builds to make testing for
> you easier so you don't have to build the project yourself.
>
> - Wes
>
> On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
> >
> > Hi,
> >
> > Oh, and one more thing: Before releasing the next Arrow version
> > incorporating the new logical types, we should definitely test that their
> > behaviour matches that of parquet-mr. When is the next release planned to
> > come out?
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
> >
> > > Hi Wes,
> > >
> > > Yes, I agree that we should do that, but then we have a problem of
> what to
> > > do in the other direction, i.e. when we use the new logical types API
> to
> > > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > > normalized flag? Tim has started a discussion about that, suggesting
> to use
> > > three states that I just answered.
> > >
> > > Br,
> > >
> > > Zoltan
> > >
> > > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com>
> wrote:
> > >
> > >> Thank for the comments.
> > >>
> > >> So in summary I think that we need to set the TIMESTAMP_* converted
> > >> types to maintain forward compatibility and stay consistent with what
> > >> we were doing in the C++ library prior to the introduction of the
> > >> LogicalType metadata.
> > >>
> > >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > >> wrote:
> > >> >
> > >> > Hi Wes,
> > >> >
> > >> > Both of the semantics are deterministic in one aspect and
> > >> indeterministic
> > >> > in another. Timestamps of instant semantic will always refer to the
> same
> > >> > instant, but their user-facing representation (how they get
> displayed)
> > >> > depends on the user's time zone. Timestamps of local semantics
> always
> > >> have
> > >> > the same user-facing representation but the instant they refer to is
> > >> > undefined (or ambigous, depending on your point of view).
> > >> >
> > >> > My understanding is that Spark uses instant semantics, i.e.,
> timestamps
> > >> are
> > >> > stored normalized to UTC and are displayed adjusted to the user's
> local
> > >> > time zone.
> > >> >
> > >> > Br,
> > >> >
> > >> > Zoltan
> > >> >
> > >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com>
> > >> wrote:
> > >> >
> > >> > > Thanks Zoltan.
> > >> > >
> > >> > > This is definitely a tricky issue.
> > >> > >
> > >> > > Spark's application of localtime semantics to timestamp data has
> been
> > >> > > a source of issues for many people. Personally I don't find that
> > >> > > behavior to be particularly helpful since depending on the session
> > >> > > time zone, you will get different results on data not marked as
> > >> > > UTC-normalized.
> > >> > >
> > >> > > In pandas, as contrast, we apply UTC semantics to
> > >> > > naive/not-explicitly-normalized data so at least the code produces
> > >> > > deterministic results on all environments. That seems strictly
> better
> > >> > > to me -- if you want a localized interpretation of naive data, you
> > >> > > must opt into this rather than having it automatically selected
> based
> > >> > > on your locale. The instances of people shooting their toes off
> due to
> > >> > > time zones are practically non-existent, whereas I'm hearing about
> > >> > > Spark gotchas all the time.
> > >> > >
> > >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi
> <zi@cloudera.com.invalid
> > >> >
> > >> > > wrote:
> > >> > > >
> > >> > > > Hi Wes,
> > >> > > >
> > >> > > > The rules for TIMESTAMP forward-compatibility were created
> based on
> > >> the
> > >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> > >> been used
> > >> > > > in the instant aka. UTC-normalized semantics so far. This
> > >> assumption was
> > >> > > > supported by two sources:
> > >> > > >
> > >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS
> and
> > >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed
> since
> > >> the
> > >> > > Unix
> > >> > > > epoch, an instant specified in UTC, from which it follows that
> they
> > >> have
> > >> > > > instant semantics (because timestamps of local semantics do not
> > >> > > correspond
> > >> > > > to a single instant).
> > >> > > >
> > >> > > > 2. Anecdotal knowledge: We were not aware of any software
> component
> > >> that
> > >> > > > used these types differently from the specification.
> > >> > > >
> > >> > > > Based on your e-mail, we were wrong on #2.
> > >> > > >
> > >> > > > From this false premise it followed that TIMESTAMPs with local
> > >> semantics
> > >> > > > were a new type and did not need to be annotated with the old
> types
> > >> to
> > >> > > > maintain compatibility. In fact, annotating them with the old
> types
> > >> were
> > >> > > > considered to be harmful, since it would have mislead older
> readers
> > >> into
> > >> > > > thinking that they can read TIMESTAMPs with local semantics,
> when in
> > >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> > >> instant
> > >> > > > semantics. This would have lead to a difference of several
> hours,
> > >> > > > corresponding to the time zone offset.
> > >> > > >
> > >> > > > In the light of your e-mail, this misinterpretation of
> timestamps
> > >> may
> > >> > > > already be happening, since if Arrow annotates local timestamps
> with
> > >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably
> misinterprets
> > >> them
> > >> > > as
> > >> > > > timestamps with instant semantics, leading to a difference of
> > >> several
> > >> > > hours.
> > >> > > >
> > >> > > > Based on this, I think it would make sense from Arrow's point of
> > >> view to
> > >> > > > annotate both semantics with the old types, since that is its
> > >> historical
> > >> > > > behaviour and keeping it up is needed for maintaining
> compatibilty.
> > >> I'm
> > >> > > not
> > >> > > > so sure about the Java library though, since as far as I know,
> these
> > >> > > types
> > >> > > > were never used in the local sense there (although I may be
> wrong
> > >> again).
> > >> > > > Were we to decide that Arrow and parquet-mr should behave
> > >> differently in
> > >> > > > this aspect though, it may be tricky to convey this distinction
> in
> > >> the
> > >> > > > specification. I would be interested in hearing your and other
> > >> > > developers'
> > >> > > > opinions on this.
> > >> > > >
> > >> > > > Thanks,
> > >> > > >
> > >> > > > Zoltan
> > >> > > >
> > >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <
> wesmckinn@gmail.com>
> > >> wrote:
> > >> > > >
> > >> > > > > hi folks,
> > >> > > > >
> > >> > > > > We have just recently implemented the new LogicalType unions
> in
> > >> the
> > >> > > > > Parquet C++ library and we have run into a forward
> compatibility
> > >> > > > > problem with reader versions prior to this implementation.
> > >> > > > >
> > >> > > > > To recap the issue, prior to the introduction of LogicalType,
> the
> > >> > > > > Parquet format had no explicit notion of time zones or UTC
> > >> > > > > normalization. The new TimestampType provides a flag to
> indicate
> > >> > > > > UTC-normalization
> > >> > > > >
> > >> > > > > struct TimestampType {
> > >> > > > > 1: required bool isAdjustedToUTC
> > >> > > > > 2: required TimeUnit unit
> > >> > > > > }
> > >> > > > >
> > >> > > > > When using this new type, the ConvertedType field must also be
> > >> set for
> > >> > > > > forward compatibility (so that old readers can still
> understand
> > >> the
> > >> > > > > data), but parquet.thrift says
> > >> > > > >
> > >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> > >> TIMESTAMP(isAdjustedToUTC =
> > >> > > > > true, unit = MICROS)
> > >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> > >> TIMESTAMP(isAdjustedToUTC =
> > >> > > > > true, unit = MILLIS)
> > >> > > > > 8: TimestampType TIMESTAMP
> > >> > > > >
> > >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > >> > > > >
> > >> > > > > * Timestamp without time zone (no UTC normalization indicated)
> > >> > > > > * Timestamp with time zone (values UTC-normalized)
> > >> > > > >
> > >> > > > > Prior to the introduction of LogicalType, we would set either
> > >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > >> > > > > normalization. So when reading the data back, any notion of
> > >> having had
> > >> > > > > a time zone is lost (it could be stored in schema metadata if
> > >> > > > > desired).
> > >> > > > >
> > >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_
> when
> > >> > > > > isAdjustedToUTC is true creates a forward compatibility break
> in
> > >> this
> > >> > > > > regard. This was reported to us shortly after releasing Apache
> > >> Arrow
> > >> > > > > 0.14.0:
> > >> > > > >
> > >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > >> > > > >
> > >> > > > > We are discussing setting the ConvertedType unconditionally in
> > >> > > > >
> > >> > > > > https://github.com/apache/arrow/pull/4825
> > >> > > > >
> > >> > > > > This might need to be a setting that is toggled when data is
> > >> coming
> > >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> > >> intended
> > >> > > > > forward compatibility interpretation, and if not should we
> amend.
> > >> > > > >
> > >> > > > > Thanks,
> > >> > > > > Wes
> > >> > > > >
> > >> > >
> > >>
> > >
>

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Wes McKinney <we...@gmail.com>.
hi Zoltan -- given the raging fire that is 0.14.0 as a result of these
issues and others we need to make a new release within the next 7-10
days. We can point you to nightly Python builds to make testing for
you easier so you don't have to build the project yourself.

- Wes

On Wed, Jul 10, 2019 at 9:11 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi,
>
> Oh, and one more thing: Before releasing the next Arrow version
> incorporating the new logical types, we should definitely test that their
> behaviour matches that of parquet-mr. When is the next release planned to
> come out?
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:
>
> > Hi Wes,
> >
> > Yes, I agree that we should do that, but then we have a problem of what to
> > do in the other direction, i.e. when we use the new logical types API to
> > read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> > normalized flag? Tim has started a discussion about that, suggesting to use
> > three states that I just answered.
> >
> > Br,
> >
> > Zoltan
> >
> > On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com> wrote:
> >
> >> Thank for the comments.
> >>
> >> So in summary I think that we need to set the TIMESTAMP_* converted
> >> types to maintain forward compatibility and stay consistent with what
> >> we were doing in the C++ library prior to the introduction of the
> >> LogicalType metadata.
> >>
> >> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> >> wrote:
> >> >
> >> > Hi Wes,
> >> >
> >> > Both of the semantics are deterministic in one aspect and
> >> indeterministic
> >> > in another. Timestamps of instant semantic will always refer to the same
> >> > instant, but their user-facing representation (how they get displayed)
> >> > depends on the user's time zone. Timestamps of local semantics always
> >> have
> >> > the same user-facing representation but the instant they refer to is
> >> > undefined (or ambigous, depending on your point of view).
> >> >
> >> > My understanding is that Spark uses instant semantics, i.e., timestamps
> >> are
> >> > stored normalized to UTC and are displayed adjusted to the user's local
> >> > time zone.
> >> >
> >> > Br,
> >> >
> >> > Zoltan
> >> >
> >> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> >
> >> > > Thanks Zoltan.
> >> > >
> >> > > This is definitely a tricky issue.
> >> > >
> >> > > Spark's application of localtime semantics to timestamp data has been
> >> > > a source of issues for many people. Personally I don't find that
> >> > > behavior to be particularly helpful since depending on the session
> >> > > time zone, you will get different results on data not marked as
> >> > > UTC-normalized.
> >> > >
> >> > > In pandas, as contrast, we apply UTC semantics to
> >> > > naive/not-explicitly-normalized data so at least the code produces
> >> > > deterministic results on all environments. That seems strictly better
> >> > > to me -- if you want a localized interpretation of naive data, you
> >> > > must opt into this rather than having it automatically selected based
> >> > > on your locale. The instances of people shooting their toes off due to
> >> > > time zones are practically non-existent, whereas I'm hearing about
> >> > > Spark gotchas all the time.
> >> > >
> >> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> >> >
> >> > > wrote:
> >> > > >
> >> > > > Hi Wes,
> >> > > >
> >> > > > The rules for TIMESTAMP forward-compatibility were created based on
> >> the
> >> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
> >> been used
> >> > > > in the instant aka. UTC-normalized semantics so far. This
> >> assumption was
> >> > > > supported by two sources:
> >> > > >
> >> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> >> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since
> >> the
> >> > > Unix
> >> > > > epoch, an instant specified in UTC, from which it follows that they
> >> have
> >> > > > instant semantics (because timestamps of local semantics do not
> >> > > correspond
> >> > > > to a single instant).
> >> > > >
> >> > > > 2. Anecdotal knowledge: We were not aware of any software component
> >> that
> >> > > > used these types differently from the specification.
> >> > > >
> >> > > > Based on your e-mail, we were wrong on #2.
> >> > > >
> >> > > > From this false premise it followed that TIMESTAMPs with local
> >> semantics
> >> > > > were a new type and did not need to be annotated with the old types
> >> to
> >> > > > maintain compatibility. In fact, annotating them with the old types
> >> were
> >> > > > considered to be harmful, since it would have mislead older readers
> >> into
> >> > > > thinking that they can read TIMESTAMPs with local semantics, when in
> >> > > > reality they would have misinterpreted them as TIMESTAMPs with
> >> instant
> >> > > > semantics. This would have lead to a difference of several hours,
> >> > > > corresponding to the time zone offset.
> >> > > >
> >> > > > In the light of your e-mail, this misinterpretation of timestamps
> >> may
> >> > > > already be happening, since if Arrow annotates local timestamps with
> >> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets
> >> them
> >> > > as
> >> > > > timestamps with instant semantics, leading to a difference of
> >> several
> >> > > hours.
> >> > > >
> >> > > > Based on this, I think it would make sense from Arrow's point of
> >> view to
> >> > > > annotate both semantics with the old types, since that is its
> >> historical
> >> > > > behaviour and keeping it up is needed for maintaining compatibilty.
> >> I'm
> >> > > not
> >> > > > so sure about the Java library though, since as far as I know, these
> >> > > types
> >> > > > were never used in the local sense there (although I may be wrong
> >> again).
> >> > > > Were we to decide that Arrow and parquet-mr should behave
> >> differently in
> >> > > > this aspect though, it may be tricky to convey this distinction in
> >> the
> >> > > > specification. I would be interested in hearing your and other
> >> > > developers'
> >> > > > opinions on this.
> >> > > >
> >> > > > Thanks,
> >> > > >
> >> > > > Zoltan
> >> > > >
> >> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <we...@gmail.com>
> >> wrote:
> >> > > >
> >> > > > > hi folks,
> >> > > > >
> >> > > > > We have just recently implemented the new LogicalType unions in
> >> the
> >> > > > > Parquet C++ library and we have run into a forward compatibility
> >> > > > > problem with reader versions prior to this implementation.
> >> > > > >
> >> > > > > To recap the issue, prior to the introduction of LogicalType, the
> >> > > > > Parquet format had no explicit notion of time zones or UTC
> >> > > > > normalization. The new TimestampType provides a flag to indicate
> >> > > > > UTC-normalization
> >> > > > >
> >> > > > > struct TimestampType {
> >> > > > > 1: required bool isAdjustedToUTC
> >> > > > > 2: required TimeUnit unit
> >> > > > > }
> >> > > > >
> >> > > > > When using this new type, the ConvertedType field must also be
> >> set for
> >> > > > > forward compatibility (so that old readers can still understand
> >> the
> >> > > > > data), but parquet.thrift says
> >> > > > >
> >> > > > > // use ConvertedType TIMESTAMP_MICROS for
> >> TIMESTAMP(isAdjustedToUTC =
> >> > > > > true, unit = MICROS)
> >> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> >> TIMESTAMP(isAdjustedToUTC =
> >> > > > > true, unit = MILLIS)
> >> > > > > 8: TimestampType TIMESTAMP
> >> > > > >
> >> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> >> > > > >
> >> > > > > * Timestamp without time zone (no UTC normalization indicated)
> >> > > > > * Timestamp with time zone (values UTC-normalized)
> >> > > > >
> >> > > > > Prior to the introduction of LogicalType, we would set either
> >> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> >> > > > > normalization. So when reading the data back, any notion of
> >> having had
> >> > > > > a time zone is lost (it could be stored in schema metadata if
> >> > > > > desired).
> >> > > > >
> >> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
> >> > > > > isAdjustedToUTC is true creates a forward compatibility break in
> >> this
> >> > > > > regard. This was reported to us shortly after releasing Apache
> >> Arrow
> >> > > > > 0.14.0:
> >> > > > >
> >> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> >> > > > >
> >> > > > > We are discussing setting the ConvertedType unconditionally in
> >> > > > >
> >> > > > > https://github.com/apache/arrow/pull/4825
> >> > > > >
> >> > > > > This might need to be a setting that is toggled when data is
> >> coming
> >> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> >> intended
> >> > > > > forward compatibility interpretation, and if not should we amend.
> >> > > > >
> >> > > > > Thanks,
> >> > > > > Wes
> >> > > > >
> >> > >
> >>
> >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

Oh, and one more thing: Before releasing the next Arrow version
incorporating the new logical types, we should definitely test that their
behaviour matches that of parquet-mr. When is the next release planned to
come out?

Br,

Zoltan

On Wed, Jul 10, 2019 at 3:57 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi Wes,
>
> Yes, I agree that we should do that, but then we have a problem of what to
> do in the other direction, i.e. when we use the new logical types API to
> read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
> normalized flag? Tim has started a discussion about that, suggesting to use
> three states that I just answered.
>
> Br,
>
> Zoltan
>
> On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com> wrote:
>
>> Thank for the comments.
>>
>> So in summary I think that we need to set the TIMESTAMP_* converted
>> types to maintain forward compatibility and stay consistent with what
>> we were doing in the C++ library prior to the introduction of the
>> LogicalType metadata.
>>
>> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
>> wrote:
>> >
>> > Hi Wes,
>> >
>> > Both of the semantics are deterministic in one aspect and
>> indeterministic
>> > in another. Timestamps of instant semantic will always refer to the same
>> > instant, but their user-facing representation (how they get displayed)
>> > depends on the user's time zone. Timestamps of local semantics always
>> have
>> > the same user-facing representation but the instant they refer to is
>> > undefined (or ambigous, depending on your point of view).
>> >
>> > My understanding is that Spark uses instant semantics, i.e., timestamps
>> are
>> > stored normalized to UTC and are displayed adjusted to the user's local
>> > time zone.
>> >
>> > Br,
>> >
>> > Zoltan
>> >
>> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> >
>> > > Thanks Zoltan.
>> > >
>> > > This is definitely a tricky issue.
>> > >
>> > > Spark's application of localtime semantics to timestamp data has been
>> > > a source of issues for many people. Personally I don't find that
>> > > behavior to be particularly helpful since depending on the session
>> > > time zone, you will get different results on data not marked as
>> > > UTC-normalized.
>> > >
>> > > In pandas, as contrast, we apply UTC semantics to
>> > > naive/not-explicitly-normalized data so at least the code produces
>> > > deterministic results on all environments. That seems strictly better
>> > > to me -- if you want a localized interpretation of naive data, you
>> > > must opt into this rather than having it automatically selected based
>> > > on your locale. The instances of people shooting their toes off due to
>> > > time zones are practically non-existent, whereas I'm hearing about
>> > > Spark gotchas all the time.
>> > >
>> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <zi@cloudera.com.invalid
>> >
>> > > wrote:
>> > > >
>> > > > Hi Wes,
>> > > >
>> > > > The rules for TIMESTAMP forward-compatibility were created based on
>> the
>> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only
>> been used
>> > > > in the instant aka. UTC-normalized semantics so far. This
>> assumption was
>> > > > supported by two sources:
>> > > >
>> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
>> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since
>> the
>> > > Unix
>> > > > epoch, an instant specified in UTC, from which it follows that they
>> have
>> > > > instant semantics (because timestamps of local semantics do not
>> > > correspond
>> > > > to a single instant).
>> > > >
>> > > > 2. Anecdotal knowledge: We were not aware of any software component
>> that
>> > > > used these types differently from the specification.
>> > > >
>> > > > Based on your e-mail, we were wrong on #2.
>> > > >
>> > > > From this false premise it followed that TIMESTAMPs with local
>> semantics
>> > > > were a new type and did not need to be annotated with the old types
>> to
>> > > > maintain compatibility. In fact, annotating them with the old types
>> were
>> > > > considered to be harmful, since it would have mislead older readers
>> into
>> > > > thinking that they can read TIMESTAMPs with local semantics, when in
>> > > > reality they would have misinterpreted them as TIMESTAMPs with
>> instant
>> > > > semantics. This would have lead to a difference of several hours,
>> > > > corresponding to the time zone offset.
>> > > >
>> > > > In the light of your e-mail, this misinterpretation of timestamps
>> may
>> > > > already be happening, since if Arrow annotates local timestamps with
>> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets
>> them
>> > > as
>> > > > timestamps with instant semantics, leading to a difference of
>> several
>> > > hours.
>> > > >
>> > > > Based on this, I think it would make sense from Arrow's point of
>> view to
>> > > > annotate both semantics with the old types, since that is its
>> historical
>> > > > behaviour and keeping it up is needed for maintaining compatibilty.
>> I'm
>> > > not
>> > > > so sure about the Java library though, since as far as I know, these
>> > > types
>> > > > were never used in the local sense there (although I may be wrong
>> again).
>> > > > Were we to decide that Arrow and parquet-mr should behave
>> differently in
>> > > > this aspect though, it may be tricky to convey this distinction in
>> the
>> > > > specification. I would be interested in hearing your and other
>> > > developers'
>> > > > opinions on this.
>> > > >
>> > > > Thanks,
>> > > >
>> > > > Zoltan
>> > > >
>> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <we...@gmail.com>
>> wrote:
>> > > >
>> > > > > hi folks,
>> > > > >
>> > > > > We have just recently implemented the new LogicalType unions in
>> the
>> > > > > Parquet C++ library and we have run into a forward compatibility
>> > > > > problem with reader versions prior to this implementation.
>> > > > >
>> > > > > To recap the issue, prior to the introduction of LogicalType, the
>> > > > > Parquet format had no explicit notion of time zones or UTC
>> > > > > normalization. The new TimestampType provides a flag to indicate
>> > > > > UTC-normalization
>> > > > >
>> > > > > struct TimestampType {
>> > > > > 1: required bool isAdjustedToUTC
>> > > > > 2: required TimeUnit unit
>> > > > > }
>> > > > >
>> > > > > When using this new type, the ConvertedType field must also be
>> set for
>> > > > > forward compatibility (so that old readers can still understand
>> the
>> > > > > data), but parquet.thrift says
>> > > > >
>> > > > > // use ConvertedType TIMESTAMP_MICROS for
>> TIMESTAMP(isAdjustedToUTC =
>> > > > > true, unit = MICROS)
>> > > > > // use ConvertedType TIMESTAMP_MILLIS for
>> TIMESTAMP(isAdjustedToUTC =
>> > > > > true, unit = MILLIS)
>> > > > > 8: TimestampType TIMESTAMP
>> > > > >
>> > > > > In Apache Arrow, we have 2 varieties of timestamps:
>> > > > >
>> > > > > * Timestamp without time zone (no UTC normalization indicated)
>> > > > > * Timestamp with time zone (values UTC-normalized)
>> > > > >
>> > > > > Prior to the introduction of LogicalType, we would set either
>> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
>> > > > > normalization. So when reading the data back, any notion of
>> having had
>> > > > > a time zone is lost (it could be stored in schema metadata if
>> > > > > desired).
>> > > > >
>> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
>> > > > > isAdjustedToUTC is true creates a forward compatibility break in
>> this
>> > > > > regard. This was reported to us shortly after releasing Apache
>> Arrow
>> > > > > 0.14.0:
>> > > > >
>> > > > > https://issues.apache.org/jira/browse/ARROW-5878
>> > > > >
>> > > > > We are discussing setting the ConvertedType unconditionally in
>> > > > >
>> > > > > https://github.com/apache/arrow/pull/4825
>> > > > >
>> > > > > This might need to be a setting that is toggled when data is
>> coming
>> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
>> intended
>> > > > > forward compatibility interpretation, and if not should we amend.
>> > > > >
>> > > > > Thanks,
>> > > > > Wes
>> > > > >
>> > >
>>
>

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Wes,

Yes, I agree that we should do that, but then we have a problem of what to
do in the other direction, i.e. when we use the new logical types API to
read a TIMESTAMP_MILLIS or TIMESTAMP_MICROS, how should we set the UTC
normalized flag? Tim has started a discussion about that, suggesting to use
three states that I just answered.

Br,

Zoltan

On Wed, Jul 10, 2019 at 3:52 PM Wes McKinney <we...@gmail.com> wrote:

> Thank for the comments.
>
> So in summary I think that we need to set the TIMESTAMP_* converted
> types to maintain forward compatibility and stay consistent with what
> we were doing in the C++ library prior to the introduction of the
> LogicalType metadata.
>
> On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
> >
> > Hi Wes,
> >
> > Both of the semantics are deterministic in one aspect and indeterministic
> > in another. Timestamps of instant semantic will always refer to the same
> > instant, but their user-facing representation (how they get displayed)
> > depends on the user's time zone. Timestamps of local semantics always
> have
> > the same user-facing representation but the instant they refer to is
> > undefined (or ambigous, depending on your point of view).
> >
> > My understanding is that Spark uses instant semantics, i.e., timestamps
> are
> > stored normalized to UTC and are displayed adjusted to the user's local
> > time zone.
> >
> > Br,
> >
> > Zoltan
> >
> > On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > Thanks Zoltan.
> > >
> > > This is definitely a tricky issue.
> > >
> > > Spark's application of localtime semantics to timestamp data has been
> > > a source of issues for many people. Personally I don't find that
> > > behavior to be particularly helpful since depending on the session
> > > time zone, you will get different results on data not marked as
> > > UTC-normalized.
> > >
> > > In pandas, as contrast, we apply UTC semantics to
> > > naive/not-explicitly-normalized data so at least the code produces
> > > deterministic results on all environments. That seems strictly better
> > > to me -- if you want a localized interpretation of naive data, you
> > > must opt into this rather than having it automatically selected based
> > > on your locale. The instances of people shooting their toes off due to
> > > time zones are practically non-existent, whereas I'm hearing about
> > > Spark gotchas all the time.
> > >
> > > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <zi@cloudera.com.invalid
> >
> > > wrote:
> > > >
> > > > Hi Wes,
> > > >
> > > > The rules for TIMESTAMP forward-compatibility were created based on
> the
> > > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been
> used
> > > > in the instant aka. UTC-normalized semantics so far. This assumption
> was
> > > > supported by two sources:
> > > >
> > > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> > > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since
> the
> > > Unix
> > > > epoch, an instant specified in UTC, from which it follows that they
> have
> > > > instant semantics (because timestamps of local semantics do not
> > > correspond
> > > > to a single instant).
> > > >
> > > > 2. Anecdotal knowledge: We were not aware of any software component
> that
> > > > used these types differently from the specification.
> > > >
> > > > Based on your e-mail, we were wrong on #2.
> > > >
> > > > From this false premise it followed that TIMESTAMPs with local
> semantics
> > > > were a new type and did not need to be annotated with the old types
> to
> > > > maintain compatibility. In fact, annotating them with the old types
> were
> > > > considered to be harmful, since it would have mislead older readers
> into
> > > > thinking that they can read TIMESTAMPs with local semantics, when in
> > > > reality they would have misinterpreted them as TIMESTAMPs with
> instant
> > > > semantics. This would have lead to a difference of several hours,
> > > > corresponding to the time zone offset.
> > > >
> > > > In the light of your e-mail, this misinterpretation of timestamps may
> > > > already be happening, since if Arrow annotates local timestamps with
> > > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets
> them
> > > as
> > > > timestamps with instant semantics, leading to a difference of several
> > > hours.
> > > >
> > > > Based on this, I think it would make sense from Arrow's point of
> view to
> > > > annotate both semantics with the old types, since that is its
> historical
> > > > behaviour and keeping it up is needed for maintaining compatibilty.
> I'm
> > > not
> > > > so sure about the Java library though, since as far as I know, these
> > > types
> > > > were never used in the local sense there (although I may be wrong
> again).
> > > > Were we to decide that Arrow and parquet-mr should behave
> differently in
> > > > this aspect though, it may be tricky to convey this distinction in
> the
> > > > specification. I would be interested in hearing your and other
> > > developers'
> > > > opinions on this.
> > > >
> > > > Thanks,
> > > >
> > > > Zoltan
> > > >
> > > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <we...@gmail.com>
> wrote:
> > > >
> > > > > hi folks,
> > > > >
> > > > > We have just recently implemented the new LogicalType unions in the
> > > > > Parquet C++ library and we have run into a forward compatibility
> > > > > problem with reader versions prior to this implementation.
> > > > >
> > > > > To recap the issue, prior to the introduction of LogicalType, the
> > > > > Parquet format had no explicit notion of time zones or UTC
> > > > > normalization. The new TimestampType provides a flag to indicate
> > > > > UTC-normalization
> > > > >
> > > > > struct TimestampType {
> > > > > 1: required bool isAdjustedToUTC
> > > > > 2: required TimeUnit unit
> > > > > }
> > > > >
> > > > > When using this new type, the ConvertedType field must also be set
> for
> > > > > forward compatibility (so that old readers can still understand the
> > > > > data), but parquet.thrift says
> > > > >
> > > > > // use ConvertedType TIMESTAMP_MICROS for
> TIMESTAMP(isAdjustedToUTC =
> > > > > true, unit = MICROS)
> > > > > // use ConvertedType TIMESTAMP_MILLIS for
> TIMESTAMP(isAdjustedToUTC =
> > > > > true, unit = MILLIS)
> > > > > 8: TimestampType TIMESTAMP
> > > > >
> > > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > > >
> > > > > * Timestamp without time zone (no UTC normalization indicated)
> > > > > * Timestamp with time zone (values UTC-normalized)
> > > > >
> > > > > Prior to the introduction of LogicalType, we would set either
> > > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > > normalization. So when reading the data back, any notion of having
> had
> > > > > a time zone is lost (it could be stored in schema metadata if
> > > > > desired).
> > > > >
> > > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
> > > > > isAdjustedToUTC is true creates a forward compatibility break in
> this
> > > > > regard. This was reported to us shortly after releasing Apache
> Arrow
> > > > > 0.14.0:
> > > > >
> > > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > > >
> > > > > We are discussing setting the ConvertedType unconditionally in
> > > > >
> > > > > https://github.com/apache/arrow/pull/4825
> > > > >
> > > > > This might need to be a setting that is toggled when data is coming
> > > > > from Arrow, but I wonder if the text in parquet.thrift is the
> intended
> > > > > forward compatibility interpretation, and if not should we amend.
> > > > >
> > > > > Thanks,
> > > > > Wes
> > > > >
> > >
>

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Wes McKinney <we...@gmail.com>.
Thank for the comments.

So in summary I think that we need to set the TIMESTAMP_* converted
types to maintain forward compatibility and stay consistent with what
we were doing in the C++ library prior to the introduction of the
LogicalType metadata.

On Wed, Jul 10, 2019 at 8:20 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi Wes,
>
> Both of the semantics are deterministic in one aspect and indeterministic
> in another. Timestamps of instant semantic will always refer to the same
> instant, but their user-facing representation (how they get displayed)
> depends on the user's time zone. Timestamps of local semantics always have
> the same user-facing representation but the instant they refer to is
> undefined (or ambigous, depending on your point of view).
>
> My understanding is that Spark uses instant semantics, i.e., timestamps are
> stored normalized to UTC and are displayed adjusted to the user's local
> time zone.
>
> Br,
>
> Zoltan
>
> On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com> wrote:
>
> > Thanks Zoltan.
> >
> > This is definitely a tricky issue.
> >
> > Spark's application of localtime semantics to timestamp data has been
> > a source of issues for many people. Personally I don't find that
> > behavior to be particularly helpful since depending on the session
> > time zone, you will get different results on data not marked as
> > UTC-normalized.
> >
> > In pandas, as contrast, we apply UTC semantics to
> > naive/not-explicitly-normalized data so at least the code produces
> > deterministic results on all environments. That seems strictly better
> > to me -- if you want a localized interpretation of naive data, you
> > must opt into this rather than having it automatically selected based
> > on your locale. The instances of people shooting their toes off due to
> > time zones are practically non-existent, whereas I'm hearing about
> > Spark gotchas all the time.
> >
> > On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> > wrote:
> > >
> > > Hi Wes,
> > >
> > > The rules for TIMESTAMP forward-compatibility were created based on the
> > > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used
> > > in the instant aka. UTC-normalized semantics so far. This assumption was
> > > supported by two sources:
> > >
> > > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> > > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the
> > Unix
> > > epoch, an instant specified in UTC, from which it follows that they have
> > > instant semantics (because timestamps of local semantics do not
> > correspond
> > > to a single instant).
> > >
> > > 2. Anecdotal knowledge: We were not aware of any software component that
> > > used these types differently from the specification.
> > >
> > > Based on your e-mail, we were wrong on #2.
> > >
> > > From this false premise it followed that TIMESTAMPs with local semantics
> > > were a new type and did not need to be annotated with the old types to
> > > maintain compatibility. In fact, annotating them with the old types were
> > > considered to be harmful, since it would have mislead older readers into
> > > thinking that they can read TIMESTAMPs with local semantics, when in
> > > reality they would have misinterpreted them as TIMESTAMPs with instant
> > > semantics. This would have lead to a difference of several hours,
> > > corresponding to the time zone offset.
> > >
> > > In the light of your e-mail, this misinterpretation of timestamps may
> > > already be happening, since if Arrow annotates local timestamps with
> > > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them
> > as
> > > timestamps with instant semantics, leading to a difference of several
> > hours.
> > >
> > > Based on this, I think it would make sense from Arrow's point of view to
> > > annotate both semantics with the old types, since that is its historical
> > > behaviour and keeping it up is needed for maintaining compatibilty. I'm
> > not
> > > so sure about the Java library though, since as far as I know, these
> > types
> > > were never used in the local sense there (although I may be wrong again).
> > > Were we to decide that Arrow and parquet-mr should behave differently in
> > > this aspect though, it may be tricky to convey this distinction in the
> > > specification. I would be interested in hearing your and other
> > developers'
> > > opinions on this.
> > >
> > > Thanks,
> > >
> > > Zoltan
> > >
> > > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
> > >
> > > > hi folks,
> > > >
> > > > We have just recently implemented the new LogicalType unions in the
> > > > Parquet C++ library and we have run into a forward compatibility
> > > > problem with reader versions prior to this implementation.
> > > >
> > > > To recap the issue, prior to the introduction of LogicalType, the
> > > > Parquet format had no explicit notion of time zones or UTC
> > > > normalization. The new TimestampType provides a flag to indicate
> > > > UTC-normalization
> > > >
> > > > struct TimestampType {
> > > > 1: required bool isAdjustedToUTC
> > > > 2: required TimeUnit unit
> > > > }
> > > >
> > > > When using this new type, the ConvertedType field must also be set for
> > > > forward compatibility (so that old readers can still understand the
> > > > data), but parquet.thrift says
> > > >
> > > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
> > > > true, unit = MICROS)
> > > > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
> > > > true, unit = MILLIS)
> > > > 8: TimestampType TIMESTAMP
> > > >
> > > > In Apache Arrow, we have 2 varieties of timestamps:
> > > >
> > > > * Timestamp without time zone (no UTC normalization indicated)
> > > > * Timestamp with time zone (values UTC-normalized)
> > > >
> > > > Prior to the introduction of LogicalType, we would set either
> > > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > > normalization. So when reading the data back, any notion of having had
> > > > a time zone is lost (it could be stored in schema metadata if
> > > > desired).
> > > >
> > > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
> > > > isAdjustedToUTC is true creates a forward compatibility break in this
> > > > regard. This was reported to us shortly after releasing Apache Arrow
> > > > 0.14.0:
> > > >
> > > > https://issues.apache.org/jira/browse/ARROW-5878
> > > >
> > > > We are discussing setting the ConvertedType unconditionally in
> > > >
> > > > https://github.com/apache/arrow/pull/4825
> > > >
> > > > This might need to be a setting that is toggled when data is coming
> > > > from Arrow, but I wonder if the text in parquet.thrift is the intended
> > > > forward compatibility interpretation, and if not should we amend.
> > > >
> > > > Thanks,
> > > > Wes
> > > >
> >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Wes,

Both of the semantics are deterministic in one aspect and indeterministic
in another. Timestamps of instant semantic will always refer to the same
instant, but their user-facing representation (how they get displayed)
depends on the user's time zone. Timestamps of local semantics always have
the same user-facing representation but the instant they refer to is
undefined (or ambigous, depending on your point of view).

My understanding is that Spark uses instant semantics, i.e., timestamps are
stored normalized to UTC and are displayed adjusted to the user's local
time zone.

Br,

Zoltan

On Tue, Jul 9, 2019 at 7:04 PM Wes McKinney <we...@gmail.com> wrote:

> Thanks Zoltan.
>
> This is definitely a tricky issue.
>
> Spark's application of localtime semantics to timestamp data has been
> a source of issues for many people. Personally I don't find that
> behavior to be particularly helpful since depending on the session
> time zone, you will get different results on data not marked as
> UTC-normalized.
>
> In pandas, as contrast, we apply UTC semantics to
> naive/not-explicitly-normalized data so at least the code produces
> deterministic results on all environments. That seems strictly better
> to me -- if you want a localized interpretation of naive data, you
> must opt into this rather than having it automatically selected based
> on your locale. The instances of people shooting their toes off due to
> time zones are practically non-existent, whereas I'm hearing about
> Spark gotchas all the time.
>
> On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
> >
> > Hi Wes,
> >
> > The rules for TIMESTAMP forward-compatibility were created based on the
> > assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used
> > in the instant aka. UTC-normalized semantics so far. This assumption was
> > supported by two sources:
> >
> > 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> > TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the
> Unix
> > epoch, an instant specified in UTC, from which it follows that they have
> > instant semantics (because timestamps of local semantics do not
> correspond
> > to a single instant).
> >
> > 2. Anecdotal knowledge: We were not aware of any software component that
> > used these types differently from the specification.
> >
> > Based on your e-mail, we were wrong on #2.
> >
> > From this false premise it followed that TIMESTAMPs with local semantics
> > were a new type and did not need to be annotated with the old types to
> > maintain compatibility. In fact, annotating them with the old types were
> > considered to be harmful, since it would have mislead older readers into
> > thinking that they can read TIMESTAMPs with local semantics, when in
> > reality they would have misinterpreted them as TIMESTAMPs with instant
> > semantics. This would have lead to a difference of several hours,
> > corresponding to the time zone offset.
> >
> > In the light of your e-mail, this misinterpretation of timestamps may
> > already be happening, since if Arrow annotates local timestamps with
> > TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them
> as
> > timestamps with instant semantics, leading to a difference of several
> hours.
> >
> > Based on this, I think it would make sense from Arrow's point of view to
> > annotate both semantics with the old types, since that is its historical
> > behaviour and keeping it up is needed for maintaining compatibilty. I'm
> not
> > so sure about the Java library though, since as far as I know, these
> types
> > were never used in the local sense there (although I may be wrong again).
> > Were we to decide that Arrow and parquet-mr should behave differently in
> > this aspect though, it may be tricky to convey this distinction in the
> > specification. I would be interested in hearing your and other
> developers'
> > opinions on this.
> >
> > Thanks,
> >
> > Zoltan
> >
> > On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
> >
> > > hi folks,
> > >
> > > We have just recently implemented the new LogicalType unions in the
> > > Parquet C++ library and we have run into a forward compatibility
> > > problem with reader versions prior to this implementation.
> > >
> > > To recap the issue, prior to the introduction of LogicalType, the
> > > Parquet format had no explicit notion of time zones or UTC
> > > normalization. The new TimestampType provides a flag to indicate
> > > UTC-normalization
> > >
> > > struct TimestampType {
> > > 1: required bool isAdjustedToUTC
> > > 2: required TimeUnit unit
> > > }
> > >
> > > When using this new type, the ConvertedType field must also be set for
> > > forward compatibility (so that old readers can still understand the
> > > data), but parquet.thrift says
> > >
> > > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
> > > true, unit = MICROS)
> > > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
> > > true, unit = MILLIS)
> > > 8: TimestampType TIMESTAMP
> > >
> > > In Apache Arrow, we have 2 varieties of timestamps:
> > >
> > > * Timestamp without time zone (no UTC normalization indicated)
> > > * Timestamp with time zone (values UTC-normalized)
> > >
> > > Prior to the introduction of LogicalType, we would set either
> > > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > > normalization. So when reading the data back, any notion of having had
> > > a time zone is lost (it could be stored in schema metadata if
> > > desired).
> > >
> > > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
> > > isAdjustedToUTC is true creates a forward compatibility break in this
> > > regard. This was reported to us shortly after releasing Apache Arrow
> > > 0.14.0:
> > >
> > > https://issues.apache.org/jira/browse/ARROW-5878
> > >
> > > We are discussing setting the ConvertedType unconditionally in
> > >
> > > https://github.com/apache/arrow/pull/4825
> > >
> > > This might need to be a setting that is toggled when data is coming
> > > from Arrow, but I wonder if the text in parquet.thrift is the intended
> > > forward compatibility interpretation, and if not should we amend.
> > >
> > > Thanks,
> > > Wes
> > >
>

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Wes McKinney <we...@gmail.com>.
Thanks Zoltan.

This is definitely a tricky issue.

Spark's application of localtime semantics to timestamp data has been
a source of issues for many people. Personally I don't find that
behavior to be particularly helpful since depending on the session
time zone, you will get different results on data not marked as
UTC-normalized.

In pandas, as contrast, we apply UTC semantics to
naive/not-explicitly-normalized data so at least the code produces
deterministic results on all environments. That seems strictly better
to me -- if you want a localized interpretation of naive data, you
must opt into this rather than having it automatically selected based
on your locale. The instances of people shooting their toes off due to
time zones are practically non-existent, whereas I'm hearing about
Spark gotchas all the time.

On Tue, Jul 9, 2019 at 11:34 AM Zoltan Ivanfi <zi...@cloudera.com.invalid> wrote:
>
> Hi Wes,
>
> The rules for TIMESTAMP forward-compatibility were created based on the
> assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used
> in the instant aka. UTC-normalized semantics so far. This assumption was
> supported by two sources:
>
> 1. The specification: parquet-format defined TIMESTAMP_MILLIS and
> TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the Unix
> epoch, an instant specified in UTC, from which it follows that they have
> instant semantics (because timestamps of local semantics do not correspond
> to a single instant).
>
> 2. Anecdotal knowledge: We were not aware of any software component that
> used these types differently from the specification.
>
> Based on your e-mail, we were wrong on #2.
>
> From this false premise it followed that TIMESTAMPs with local semantics
> were a new type and did not need to be annotated with the old types to
> maintain compatibility. In fact, annotating them with the old types were
> considered to be harmful, since it would have mislead older readers into
> thinking that they can read TIMESTAMPs with local semantics, when in
> reality they would have misinterpreted them as TIMESTAMPs with instant
> semantics. This would have lead to a difference of several hours,
> corresponding to the time zone offset.
>
> In the light of your e-mail, this misinterpretation of timestamps may
> already be happening, since if Arrow annotates local timestamps with
> TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them as
> timestamps with instant semantics, leading to a difference of several hours.
>
> Based on this, I think it would make sense from Arrow's point of view to
> annotate both semantics with the old types, since that is its historical
> behaviour and keeping it up is needed for maintaining compatibilty. I'm not
> so sure about the Java library though, since as far as I know, these types
> were never used in the local sense there (although I may be wrong again).
> Were we to decide that Arrow and parquet-mr should behave differently in
> this aspect though, it may be tricky to convey this distinction in the
> specification. I would be interested in hearing your and other developers'
> opinions on this.
>
> Thanks,
>
> Zoltan
>
> On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:
>
> > hi folks,
> >
> > We have just recently implemented the new LogicalType unions in the
> > Parquet C++ library and we have run into a forward compatibility
> > problem with reader versions prior to this implementation.
> >
> > To recap the issue, prior to the introduction of LogicalType, the
> > Parquet format had no explicit notion of time zones or UTC
> > normalization. The new TimestampType provides a flag to indicate
> > UTC-normalization
> >
> > struct TimestampType {
> > 1: required bool isAdjustedToUTC
> > 2: required TimeUnit unit
> > }
> >
> > When using this new type, the ConvertedType field must also be set for
> > forward compatibility (so that old readers can still understand the
> > data), but parquet.thrift says
> >
> > // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
> > true, unit = MICROS)
> > // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
> > true, unit = MILLIS)
> > 8: TimestampType TIMESTAMP
> >
> > In Apache Arrow, we have 2 varieties of timestamps:
> >
> > * Timestamp without time zone (no UTC normalization indicated)
> > * Timestamp with time zone (values UTC-normalized)
> >
> > Prior to the introduction of LogicalType, we would set either
> > TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> > normalization. So when reading the data back, any notion of having had
> > a time zone is lost (it could be stored in schema metadata if
> > desired).
> >
> > I believe that setting the TIMESTAMP_* ConvertedType _only_ when
> > isAdjustedToUTC is true creates a forward compatibility break in this
> > regard. This was reported to us shortly after releasing Apache Arrow
> > 0.14.0:
> >
> > https://issues.apache.org/jira/browse/ARROW-5878
> >
> > We are discussing setting the ConvertedType unconditionally in
> >
> > https://github.com/apache/arrow/pull/4825
> >
> > This might need to be a setting that is toggled when data is coming
> > from Arrow, but I wonder if the text in parquet.thrift is the intended
> > forward compatibility interpretation, and if not should we amend.
> >
> > Thanks,
> > Wes
> >

Re: Forward compatibility issues with TIMESTAMP_MILLIS/MICROS ConvertedType

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi Wes,

The rules for TIMESTAMP forward-compatibility were created based on the
assumption that TIMESTAMP_MILLIS and TIMESTAMP_MICROS have only been used
in the instant aka. UTC-normalized semantics so far. This assumption was
supported by two sources:

1. The specification: parquet-format defined TIMESTAMP_MILLIS and
TIMESTAMP_MICROS as the number of milli/microseconds elapsed since the Unix
epoch, an instant specified in UTC, from which it follows that they have
instant semantics (because timestamps of local semantics do not correspond
to a single instant).

2. Anecdotal knowledge: We were not aware of any software component that
used these types differently from the specification.

Based on your e-mail, we were wrong on #2.

From this false premise it followed that TIMESTAMPs with local semantics
were a new type and did not need to be annotated with the old types to
maintain compatibility. In fact, annotating them with the old types were
considered to be harmful, since it would have mislead older readers into
thinking that they can read TIMESTAMPs with local semantics, when in
reality they would have misinterpreted them as TIMESTAMPs with instant
semantics. This would have lead to a difference of several hours,
corresponding to the time zone offset.

In the light of your e-mail, this misinterpretation of timestamps may
already be happening, since if Arrow annotates local timestamps with
TIMESTAMP_MILLIS or TIMESTMAP_MICROS, Spark probably misinterprets them as
timestamps with instant semantics, leading to a difference of several hours.

Based on this, I think it would make sense from Arrow's point of view to
annotate both semantics with the old types, since that is its historical
behaviour and keeping it up is needed for maintaining compatibilty. I'm not
so sure about the Java library though, since as far as I know, these types
were never used in the local sense there (although I may be wrong again).
Were we to decide that Arrow and parquet-mr should behave differently in
this aspect though, it may be tricky to convey this distinction in the
specification. I would be interested in hearing your and other developers'
opinions on this.

Thanks,

Zoltan

On Tue, Jul 9, 2019 at 5:39 PM Wes McKinney <we...@gmail.com> wrote:

> hi folks,
>
> We have just recently implemented the new LogicalType unions in the
> Parquet C++ library and we have run into a forward compatibility
> problem with reader versions prior to this implementation.
>
> To recap the issue, prior to the introduction of LogicalType, the
> Parquet format had no explicit notion of time zones or UTC
> normalization. The new TimestampType provides a flag to indicate
> UTC-normalization
>
> struct TimestampType {
> 1: required bool isAdjustedToUTC
> 2: required TimeUnit unit
> }
>
> When using this new type, the ConvertedType field must also be set for
> forward compatibility (so that old readers can still understand the
> data), but parquet.thrift says
>
> // use ConvertedType TIMESTAMP_MICROS for TIMESTAMP(isAdjustedToUTC =
> true, unit = MICROS)
> // use ConvertedType TIMESTAMP_MILLIS for TIMESTAMP(isAdjustedToUTC =
> true, unit = MILLIS)
> 8: TimestampType TIMESTAMP
>
> In Apache Arrow, we have 2 varieties of timestamps:
>
> * Timestamp without time zone (no UTC normalization indicated)
> * Timestamp with time zone (values UTC-normalized)
>
> Prior to the introduction of LogicalType, we would set either
> TIMESTAMP_MILLIS or TIMESTAMP_MICROS unconditional on UTC
> normalization. So when reading the data back, any notion of having had
> a time zone is lost (it could be stored in schema metadata if
> desired).
>
> I believe that setting the TIMESTAMP_* ConvertedType _only_ when
> isAdjustedToUTC is true creates a forward compatibility break in this
> regard. This was reported to us shortly after releasing Apache Arrow
> 0.14.0:
>
> https://issues.apache.org/jira/browse/ARROW-5878
>
> We are discussing setting the ConvertedType unconditionally in
>
> https://github.com/apache/arrow/pull/4825
>
> This might need to be a setting that is toggled when data is coming
> from Arrow, but I wonder if the text in parquet.thrift is the intended
> forward compatibility interpretation, and if not should we amend.
>
> Thanks,
> Wes
>