You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@impala.apache.org by Zoltan Ivanfi <zi...@cloudera.com> on 2019/02/20 14:56:57 UTC

Moving forward with the timestamp proposal

Hi,

Last december we shared a timestamp harmonization proposal
<https://goo.gl/VV88c5> with the Hive, Spark and Impala communities. This
was followed by an extensive discussion in January that lead to various
updates and improvements to the proposal, as well as the creation of a new
document for file format components. February has been quiet regarding this
topic and the latest revision of the proposal has been steady in the recent
weeks.

In short, the following is being proposed (please see the document for
details):

   - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
   semantics.
   - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant semantics.
   - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime semantics.

This proposal is in accordance with the SQL standard and many major DB
engines.

Based on the feedback we got I believe that the latest revision of the
proposal addresses the needs of all affected components, therefore I would
like to move forward and create JIRA-s and/or roadmap documentation pages
for the desired semantics of the different SQL types according to the
proposal.

Please let me know if you have any remaning concerns about the proposal or
about the course of action outlined above.

Thanks,

Zoltan

Re: Moving forward with the timestamp proposal

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

We can add these new SQL types by adding support to the file formats first.
But the most important and immediate goal is reserving these types for
their desired meaning and that can already be done without such support.

Of course, eventually the new types need to be implemented as well, and for
that we would need support from the file format components. I have already
contacted the Avro, ORC, Parquet, Arrow, Kudu, Iceberg and CarbonData
communities to let them know of this new requirement. Parquet, Arrow and
Iceberg already has semantics metadata that supports LocalDateTime and
Instant semantics and we plan to actively drive their addition to Avro and
would also be happy to contribute to ORC. Regarding the OffsetDateTime
semantics, I don't know about any file format that would already support it
natively.

Alternatively, we could also do the new types without such support, in
which case the semantics metadata could not be deduced from the files
themselves but would have to come directly from the user (at least
initially). This will be the case for text files for example, where no
metadata can be stored in the files. I think we should reserve this way for
file formats where having proper metadata in the files is impossible (text
files) or where the developers of a file format component prefer not to add
new types for this purpose (unlikely but possible).

Br,

Zoltan

On Thu, Feb 21, 2019 at 8:32 AM Wenchen Fan <cl...@gmail.com> wrote:

> I think this is the right direction to go, but I'm wondering how can Spark
> support these new types if the underlying data sources(like parquet files)
> do not support them yet.
>
> I took a quick look at the new doc for file formats, but not sure what's
> the proposal. Are we going to implement these new types in Parquet/Orc
> first? Or are we going to use low-level physical types directly and add
> Spark-specific metadata to Parquet/Orc files?
>
> On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > Last december we shared a timestamp harmonization proposal
> > <https://goo.gl/VV88c5> with the Hive, Spark and Impala communities.
> This
> > was followed by an extensive discussion in January that lead to various
> > updates and improvements to the proposal, as well as the creation of a
> new
> > document for file format components. February has been quiet regarding
> this
> > topic and the latest revision of the proposal has been steady in the
> recent
> > weeks.
> >
> > In short, the following is being proposed (please see the document for
> > details):
> >
> >    - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
> >    semantics.
> >    - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
> >    semantics.
> >    - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
> >    semantics.
> >
> > This proposal is in accordance with the SQL standard and many major DB
> > engines.
> >
> > Based on the feedback we got I believe that the latest revision of the
> > proposal addresses the needs of all affected components, therefore I
> would
> > like to move forward and create JIRA-s and/or roadmap documentation pages
> > for the desired semantics of the different SQL types according to the
> > proposal.
> >
> > Please let me know if you have any remaning concerns about the proposal
> or
> > about the course of action outlined above.
> >
> > Thanks,
> >
> > Zoltan
> >
>

Re: Moving forward with the timestamp proposal

Posted by Zoltan Ivanfi <zi...@cloudera.com>.
Hi,

We can add these new SQL types by adding support to the file formats first.
But the most important and immediate goal is reserving these types for
their desired meaning and that can already be done without such support.

Of course, eventually the new types need to be implemented as well, and for
that we would need support from the file format components. I have already
contacted the Avro, ORC, Parquet, Arrow, Kudu, Iceberg and CarbonData
communities to let them know of this new requirement. Parquet, Arrow and
Iceberg already has semantics metadata that supports LocalDateTime and
Instant semantics and we plan to actively drive their addition to Avro and
would also be happy to contribute to ORC. Regarding the OffsetDateTime
semantics, I don't know about any file format that would already support it
natively.

Alternatively, we could also do the new types without such support, in
which case the semantics metadata could not be deduced from the files
themselves but would have to come directly from the user (at least
initially). This will be the case for text files for example, where no
metadata can be stored in the files. I think we should reserve this way for
file formats where having proper metadata in the files is impossible (text
files) or where the developers of a file format component prefer not to add
new types for this purpose (unlikely but possible).

Br,

Zoltan

On Thu, Feb 21, 2019 at 8:32 AM Wenchen Fan <cl...@gmail.com> wrote:

> I think this is the right direction to go, but I'm wondering how can Spark
> support these new types if the underlying data sources(like parquet files)
> do not support them yet.
>
> I took a quick look at the new doc for file formats, but not sure what's
> the proposal. Are we going to implement these new types in Parquet/Orc
> first? Or are we going to use low-level physical types directly and add
> Spark-specific metadata to Parquet/Orc files?
>
> On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > Last december we shared a timestamp harmonization proposal
> > <https://goo.gl/VV88c5> with the Hive, Spark and Impala communities.
> This
> > was followed by an extensive discussion in January that lead to various
> > updates and improvements to the proposal, as well as the creation of a
> new
> > document for file format components. February has been quiet regarding
> this
> > topic and the latest revision of the proposal has been steady in the
> recent
> > weeks.
> >
> > In short, the following is being proposed (please see the document for
> > details):
> >
> >    - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
> >    semantics.
> >    - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
> >    semantics.
> >    - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
> >    semantics.
> >
> > This proposal is in accordance with the SQL standard and many major DB
> > engines.
> >
> > Based on the feedback we got I believe that the latest revision of the
> > proposal addresses the needs of all affected components, therefore I
> would
> > like to move forward and create JIRA-s and/or roadmap documentation pages
> > for the desired semantics of the different SQL types according to the
> > proposal.
> >
> > Please let me know if you have any remaning concerns about the proposal
> or
> > about the course of action outlined above.
> >
> > Thanks,
> >
> > Zoltan
> >
>

Re: Moving forward with the timestamp proposal

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

We can add these new SQL types by adding support to the file formats first.
But the most important and immediate goal is reserving these types for
their desired meaning and that can already be done without such support.

Of course, eventually the new types need to be implemented as well, and for
that we would need support from the file format components. I have already
contacted the Avro, ORC, Parquet, Arrow, Kudu, Iceberg and CarbonData
communities to let them know of this new requirement. Parquet, Arrow and
Iceberg already has semantics metadata that supports LocalDateTime and
Instant semantics and we plan to actively drive their addition to Avro and
would also be happy to contribute to ORC. Regarding the OffsetDateTime
semantics, I don't know about any file format that would already support it
natively.

Alternatively, we could also do the new types without such support, in
which case the semantics metadata could not be deduced from the files
themselves but would have to come directly from the user (at least
initially). This will be the case for text files for example, where no
metadata can be stored in the files. I think we should reserve this way for
file formats where having proper metadata in the files is impossible (text
files) or where the developers of a file format component prefer not to add
new types for this purpose (unlikely but possible).

Br,

Zoltan

On Thu, Feb 21, 2019 at 8:32 AM Wenchen Fan <cl...@gmail.com> wrote:

> I think this is the right direction to go, but I'm wondering how can Spark
> support these new types if the underlying data sources(like parquet files)
> do not support them yet.
>
> I took a quick look at the new doc for file formats, but not sure what's
> the proposal. Are we going to implement these new types in Parquet/Orc
> first? Or are we going to use low-level physical types directly and add
> Spark-specific metadata to Parquet/Orc files?
>
> On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
> wrote:
>
> > Hi,
> >
> > Last december we shared a timestamp harmonization proposal
> > <https://goo.gl/VV88c5> with the Hive, Spark and Impala communities.
> This
> > was followed by an extensive discussion in January that lead to various
> > updates and improvements to the proposal, as well as the creation of a
> new
> > document for file format components. February has been quiet regarding
> this
> > topic and the latest revision of the proposal has been steady in the
> recent
> > weeks.
> >
> > In short, the following is being proposed (please see the document for
> > details):
> >
> >    - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
> >    semantics.
> >    - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
> >    semantics.
> >    - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
> >    semantics.
> >
> > This proposal is in accordance with the SQL standard and many major DB
> > engines.
> >
> > Based on the feedback we got I believe that the latest revision of the
> > proposal addresses the needs of all affected components, therefore I
> would
> > like to move forward and create JIRA-s and/or roadmap documentation pages
> > for the desired semantics of the different SQL types according to the
> > proposal.
> >
> > Please let me know if you have any remaning concerns about the proposal
> or
> > about the course of action outlined above.
> >
> > Thanks,
> >
> > Zoltan
> >
>

Re: Moving forward with the timestamp proposal

Posted by Wenchen Fan <cl...@gmail.com>.
I think this is the right direction to go, but I'm wondering how can Spark
support these new types if the underlying data sources(like parquet files)
do not support them yet.

I took a quick look at the new doc for file formats, but not sure what's
the proposal. Are we going to implement these new types in Parquet/Orc
first? Or are we going to use low-level physical types directly and add
Spark-specific metadata to Parquet/Orc files?

On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> Last december we shared a timestamp harmonization proposal
> <https://goo.gl/VV88c5> with the Hive, Spark and Impala communities. This
> was followed by an extensive discussion in January that lead to various
> updates and improvements to the proposal, as well as the creation of a new
> document for file format components. February has been quiet regarding this
> topic and the latest revision of the proposal has been steady in the recent
> weeks.
>
> In short, the following is being proposed (please see the document for
> details):
>
>    - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
>    semantics.
>    - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
>    semantics.
>    - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
>    semantics.
>
> This proposal is in accordance with the SQL standard and many major DB
> engines.
>
> Based on the feedback we got I believe that the latest revision of the
> proposal addresses the needs of all affected components, therefore I would
> like to move forward and create JIRA-s and/or roadmap documentation pages
> for the desired semantics of the different SQL types according to the
> proposal.
>
> Please let me know if you have any remaning concerns about the proposal or
> about the course of action outlined above.
>
> Thanks,
>
> Zoltan
>

Re: Moving forward with the timestamp proposal

Posted by Wenchen Fan <cl...@gmail.com>.
I think this is the right direction to go, but I'm wondering how can Spark
support these new types if the underlying data sources(like parquet files)
do not support them yet.

I took a quick look at the new doc for file formats, but not sure what's
the proposal. Are we going to implement these new types in Parquet/Orc
first? Or are we going to use low-level physical types directly and add
Spark-specific metadata to Parquet/Orc files?

On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> Last december we shared a timestamp harmonization proposal
> <https://goo.gl/VV88c5> with the Hive, Spark and Impala communities. This
> was followed by an extensive discussion in January that lead to various
> updates and improvements to the proposal, as well as the creation of a new
> document for file format components. February has been quiet regarding this
> topic and the latest revision of the proposal has been steady in the recent
> weeks.
>
> In short, the following is being proposed (please see the document for
> details):
>
>    - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
>    semantics.
>    - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
>    semantics.
>    - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
>    semantics.
>
> This proposal is in accordance with the SQL standard and many major DB
> engines.
>
> Based on the feedback we got I believe that the latest revision of the
> proposal addresses the needs of all affected components, therefore I would
> like to move forward and create JIRA-s and/or roadmap documentation pages
> for the desired semantics of the different SQL types according to the
> proposal.
>
> Please let me know if you have any remaning concerns about the proposal or
> about the course of action outlined above.
>
> Thanks,
>
> Zoltan
>

Re: Moving forward with the timestamp proposal

Posted by Wenchen Fan <cl...@gmail.com>.
I think this is the right direction to go, but I'm wondering how can Spark
support these new types if the underlying data sources(like parquet files)
do not support them yet.

I took a quick look at the new doc for file formats, but not sure what's
the proposal. Are we going to implement these new types in Parquet/Orc
first? Or are we going to use low-level physical types directly and add
Spark-specific metadata to Parquet/Orc files?

On Wed, Feb 20, 2019 at 10:57 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> Last december we shared a timestamp harmonization proposal
> <https://goo.gl/VV88c5> with the Hive, Spark and Impala communities. This
> was followed by an extensive discussion in January that lead to various
> updates and improvements to the proposal, as well as the creation of a new
> document for file format components. February has been quiet regarding this
> topic and the latest revision of the proposal has been steady in the recent
> weeks.
>
> In short, the following is being proposed (please see the document for
> details):
>
>    - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
>    semantics.
>    - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
>    semantics.
>    - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
>    semantics.
>
> This proposal is in accordance with the SQL standard and many major DB
> engines.
>
> Based on the feedback we got I believe that the latest revision of the
> proposal addresses the needs of all affected components, therefore I would
> like to move forward and create JIRA-s and/or roadmap documentation pages
> for the desired semantics of the different SQL types according to the
> proposal.
>
> Please let me know if you have any remaning concerns about the proposal or
> about the course of action outlined above.
>
> Thanks,
>
> Zoltan
>

Re: Moving forward with the timestamp proposal

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Dear Hive Developers,

Following up on my previous mail, I have turned the timestamp proposal into
a brief design doc
<https://cwiki.apache.org/confluence/display/Hive/Different+TIMESTAMP+types>
on the Hive wiki and created the HIVE-21348
<https://issues.apache.org/jira/browse/HIVE-21348> JIRA ("Execute the
TIMESTAMP types roadmap") for tracking this effort with the following child
tasks:


   -

   HIVE-21002 <https://issues.apache.org/jira/browse/HIVE-21002> TIMESTAMP
   - Backwards incompatible change: Hive 3.1 reads back Avro and Parquet
   timestamps written by Hive 2.x incorrectly
   -

      HIVE-21290 <https://issues.apache.org/jira/browse/HIVE-21290>:
      Restore historical way of handling timestamps in Parquet while
keeping the
      new semantics at the same time
      -

      HIVE-21291 <https://issues.apache.org/jira/browse/HIVE-21291>:
      Restore historical way of handling timestamps in Avro while
keeping the new
      semantics at the same time
      -

   HIVE-21349 <https://issues.apache.org/jira/browse/HIVE-21349> TIMESTAMP
   WITHOUT TIME ZONE
   -

      HIVE-21359 <https://issues.apache.org/jira/browse/HIVE-21359> Parquet
      support for TIMESTAMP WITHOUT TIME ZONE
      -

      HIVE-21360 <https://issues.apache.org/jira/browse/HIVE-21360> Avro
      support for TIMESTAMP WITHOUT TIME ZONE
      -

      HIVE-21361 <https://issues.apache.org/jira/browse/HIVE-21361> ORC
      support for TIMESTAMP WITHOUT TIME ZONE
      -

   HIVE-21350 <https://issues.apache.org/jira/browse/HIVE-21350> TIMESTAMP
   WITH LOCAL TIME ZONE
   -

      HIVE-21353 <https://issues.apache.org/jira/browse/HIVE-21353> Use
      Instant instead of ZonedDateTime as the internal type for TIMESTAMP WITH
      LOCAL TIME ZONE
      -

      HIVE-21355 <https://issues.apache.org/jira/browse/HIVE-21355> Parquet
      support for TIMESTAMP WITH LOCAL TIME ZONE
      -

      HIVE-21357 <https://issues.apache.org/jira/browse/HIVE-21357> Avro
      support for TIMESTAMP WITH LOCAL TIME ZONE
      -

      HIVE-21358 <https://issues.apache.org/jira/browse/HIVE-21358> ORC
      support for TIMESTAMP WITH LOCAL TIME ZONE
      -

   HIVE-21351 <https://issues.apache.org/jira/browse/HIVE-21351> TIMESTAMP
   WITH TIME ZONE
   -

      No child JIRAs created yet.


The most urgent tasks are probably the ones dealing with the backward
incompatible change introduced in Hive 3.1.

Please let me know if you have any questions or concerns.

Thanks,

Zoltan

On Wed, Feb 20, 2019 at 3:56 PM Zoltan Ivanfi <zi...@cloudera.com> wrote:

> Hi,
>
> Last december we shared a timestamp harmonization proposal
> <https://goo.gl/VV88c5> with the Hive, Spark and Impala communities. This
> was followed by an extensive discussion in January that lead to various
> updates and improvements to the proposal, as well as the creation of a new
> document for file format components. February has been quiet regarding this
> topic and the latest revision of the proposal has been steady in the recent
> weeks.
>
> In short, the following is being proposed (please see the document for
> details):
>
>    - The TIMESTAMP WITHOUT TIME ZONE type should have LocalDateTime
>    semantics.
>    - The TIMESTAMP WITH LOCAL TIME ZONE type should have Instant
>    semantics.
>    - The TIMESTAMP WITH TIME ZONE type should have OffsetDateTime
>    semantics.
>
> This proposal is in accordance with the SQL standard and many major DB
> engines.
>
> Based on the feedback we got I believe that the latest revision of the
> proposal addresses the needs of all affected components, therefore I would
> like to move forward and create JIRA-s and/or roadmap documentation pages
> for the desired semantics of the different SQL types according to the
> proposal.
>
> Please let me know if you have any remaning concerns about the proposal or
> about the course of action outlined above.
>
> Thanks,
>
> Zoltan
>