You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@avro.apache.org by Nandor Kollar <nk...@cloudera.com.INVALID> on 2019/04/17 07:43:55 UTC

[DISCUSS] Support additional timestamp semantic

Hi all,

There is an ongoing effort to harmonize timestamp types for various popular
SQL engines for Hadoop (see details here
<https://docs.google.com/document/d/1E-7miCh4qK6Mg54b-Dh5VOyhGX8V4xdMXKIHJL36a9U/edit#>).
As part of this effort, on disk file formats should be able to support all
of these semantics. Avro timestamp logical type supports only one semantic:
UTC normalized. I put together a simple design doc an two POCs which
introduce additional local date/time semantics into Avro. Here is the
design doc:
https://docs.google.com/document/d/1rLmb4-6G8LHBwHUU2P_8gE1o3lvMV0gSitnmiXXmlWY/edit?usp=sharing

What are the thoughts on this? Please have a look at the POCs, and feel
free to comment the design doc!

Thanks,
Nandor

Re: [DISCUSS] Support additional timestamp semantic

Posted by Nandor Kollar <nk...@cloudera.com.INVALID>.
Great, thanks Ryan and Zoltan for your feedback! As the next step, I go
ahead and open a PR for review with option #3 soon.

On Thu, May 2, 2019 at 3:30 PM Zoltan Ivanfi <zi...@cloudera.com.invalid>
wrote:

> Hi,
>
> I also vote for the 3rd option (two new logical types:
> ‘local-timestamp-millis’ and ‘local-timestamp-micros’).
>
> Could you please create a JIRA for this task and send a link to it to
> this e-mail thread for everyone interested in the topic?
>
> Thanks,
>
> Zoltan
>
> On Tue, Apr 23, 2019 at 1:49 PM Ryan Skraba <ry...@skraba.com> wrote:
> >
> > Hello!  I read the document with interest.  Very well-written and clean
> --
> > I feel better equipped to explain the importance of the different
> flavours
> > of date/time after reading it.
> >
> > I didn't go through the POC code in detail, but I did go through a bunch
> of
> > our code to check how the proposed implementations would affect us (to
> > provide a single, anecdotal data point).  We currently use Avro to
> > represent hierachical data internally as it passes through a
> transformation
> > pipeline running on a cluster.  We mostly rely on generic data.  The
> input
> > or output might already be in Avro (file or binary message format), but
> it
> > isn't necessary.  We do the schema inference and conversion on non-Avro
> > when required.
> >
> > For us, it looks like both option#2 and option#3 should be more-or-less
> > safe.  If we don't recognize a logical type, we'll just fall back on the
> > underlying Avro type, and even propagate the unknown logical type down
> the
> > pipeline if we can.
> >
> > Specifically, the bold proposal (option#2) for a new, unified logical
> type
> > would mostly work without code modification on our part.  There's one or
> > two places where we'd lose some helpful features where the semantic
> > date/time type is taken into account, until we did the necessary
> rewrites.
> > It wouldn't be a difficult task for us to bump to an Avro version that
> uses
> > the new, unified logical type.
> >
> > Of course, the problem occurs when we're writing out data in Avro ... and
> > the user has a next stage that doesn't understand the change.  Even if I
> > appreciate the elegance of having a unified date/type logical type, it
> > really seems like the more conservative third option (multiplying the
> > number of logical types) is preferable.  Even if Avro ends up with a
> dozen
> > logical types to describe the different flavours of date/time, this can
> > eventually be unified in the language-specific API tools without breaking
> > the schema specification.
> >
> > TL;DR: I read it, I appreciated it, I agree with your conclusions.
> >
> > Thanks again for the thorough and articulate work!  Ryan
> >
> >
> >
> > On Wed, Apr 17, 2019 at 9:44 AM Nandor Kollar
> <nk...@cloudera.com.invalid>
> > wrote:
> >
> > > Hi all,
> > >
> > > There is an ongoing effort to harmonize timestamp types for various
> popular
> > > SQL engines for Hadoop (see details here
> > > <
> > >
> https://docs.google.com/document/d/1E-7miCh4qK6Mg54b-Dh5VOyhGX8V4xdMXKIHJL36a9U/edit#
> > > >).
> > > As part of this effort, on disk file formats should be able to support
> all
> > > of these semantics. Avro timestamp logical type supports only one
> semantic:
> > > UTC normalized. I put together a simple design doc an two POCs which
> > > introduce additional local date/time semantics into Avro. Here is the
> > > design doc:
> > >
> > >
> https://docs.google.com/document/d/1rLmb4-6G8LHBwHUU2P_8gE1o3lvMV0gSitnmiXXmlWY/edit?usp=sharing
> > >
> > > What are the thoughts on this? Please have a look at the POCs, and feel
> > > free to comment the design doc!
> > >
> > > Thanks,
> > > Nandor
> > >
>

Re: [DISCUSS] Support additional timestamp semantic

Posted by Zoltan Ivanfi <zi...@cloudera.com.INVALID>.
Hi,

I also vote for the 3rd option (two new logical types:
‘local-timestamp-millis’ and ‘local-timestamp-micros’).

Could you please create a JIRA for this task and send a link to it to
this e-mail thread for everyone interested in the topic?

Thanks,

Zoltan

On Tue, Apr 23, 2019 at 1:49 PM Ryan Skraba <ry...@skraba.com> wrote:
>
> Hello!  I read the document with interest.  Very well-written and clean --
> I feel better equipped to explain the importance of the different flavours
> of date/time after reading it.
>
> I didn't go through the POC code in detail, but I did go through a bunch of
> our code to check how the proposed implementations would affect us (to
> provide a single, anecdotal data point).  We currently use Avro to
> represent hierachical data internally as it passes through a transformation
> pipeline running on a cluster.  We mostly rely on generic data.  The input
> or output might already be in Avro (file or binary message format), but it
> isn't necessary.  We do the schema inference and conversion on non-Avro
> when required.
>
> For us, it looks like both option#2 and option#3 should be more-or-less
> safe.  If we don't recognize a logical type, we'll just fall back on the
> underlying Avro type, and even propagate the unknown logical type down the
> pipeline if we can.
>
> Specifically, the bold proposal (option#2) for a new, unified logical type
> would mostly work without code modification on our part.  There's one or
> two places where we'd lose some helpful features where the semantic
> date/time type is taken into account, until we did the necessary rewrites.
> It wouldn't be a difficult task for us to bump to an Avro version that uses
> the new, unified logical type.
>
> Of course, the problem occurs when we're writing out data in Avro ... and
> the user has a next stage that doesn't understand the change.  Even if I
> appreciate the elegance of having a unified date/type logical type, it
> really seems like the more conservative third option (multiplying the
> number of logical types) is preferable.  Even if Avro ends up with a dozen
> logical types to describe the different flavours of date/time, this can
> eventually be unified in the language-specific API tools without breaking
> the schema specification.
>
> TL;DR: I read it, I appreciated it, I agree with your conclusions.
>
> Thanks again for the thorough and articulate work!  Ryan
>
>
>
> On Wed, Apr 17, 2019 at 9:44 AM Nandor Kollar <nk...@cloudera.com.invalid>
> wrote:
>
> > Hi all,
> >
> > There is an ongoing effort to harmonize timestamp types for various popular
> > SQL engines for Hadoop (see details here
> > <
> > https://docs.google.com/document/d/1E-7miCh4qK6Mg54b-Dh5VOyhGX8V4xdMXKIHJL36a9U/edit#
> > >).
> > As part of this effort, on disk file formats should be able to support all
> > of these semantics. Avro timestamp logical type supports only one semantic:
> > UTC normalized. I put together a simple design doc an two POCs which
> > introduce additional local date/time semantics into Avro. Here is the
> > design doc:
> >
> > https://docs.google.com/document/d/1rLmb4-6G8LHBwHUU2P_8gE1o3lvMV0gSitnmiXXmlWY/edit?usp=sharing
> >
> > What are the thoughts on this? Please have a look at the POCs, and feel
> > free to comment the design doc!
> >
> > Thanks,
> > Nandor
> >

Re: [DISCUSS] Support additional timestamp semantic

Posted by Ryan Skraba <ry...@skraba.com>.
Hello!  I read the document with interest.  Very well-written and clean --
I feel better equipped to explain the importance of the different flavours
of date/time after reading it.

I didn't go through the POC code in detail, but I did go through a bunch of
our code to check how the proposed implementations would affect us (to
provide a single, anecdotal data point).  We currently use Avro to
represent hierachical data internally as it passes through a transformation
pipeline running on a cluster.  We mostly rely on generic data.  The input
or output might already be in Avro (file or binary message format), but it
isn't necessary.  We do the schema inference and conversion on non-Avro
when required.

For us, it looks like both option#2 and option#3 should be more-or-less
safe.  If we don't recognize a logical type, we'll just fall back on the
underlying Avro type, and even propagate the unknown logical type down the
pipeline if we can.

Specifically, the bold proposal (option#2) for a new, unified logical type
would mostly work without code modification on our part.  There's one or
two places where we'd lose some helpful features where the semantic
date/time type is taken into account, until we did the necessary rewrites.
It wouldn't be a difficult task for us to bump to an Avro version that uses
the new, unified logical type.

Of course, the problem occurs when we're writing out data in Avro ... and
the user has a next stage that doesn't understand the change.  Even if I
appreciate the elegance of having a unified date/type logical type, it
really seems like the more conservative third option (multiplying the
number of logical types) is preferable.  Even if Avro ends up with a dozen
logical types to describe the different flavours of date/time, this can
eventually be unified in the language-specific API tools without breaking
the schema specification.

TL;DR: I read it, I appreciated it, I agree with your conclusions.

Thanks again for the thorough and articulate work!  Ryan



On Wed, Apr 17, 2019 at 9:44 AM Nandor Kollar <nk...@cloudera.com.invalid>
wrote:

> Hi all,
>
> There is an ongoing effort to harmonize timestamp types for various popular
> SQL engines for Hadoop (see details here
> <
> https://docs.google.com/document/d/1E-7miCh4qK6Mg54b-Dh5VOyhGX8V4xdMXKIHJL36a9U/edit#
> >).
> As part of this effort, on disk file formats should be able to support all
> of these semantics. Avro timestamp logical type supports only one semantic:
> UTC normalized. I put together a simple design doc an two POCs which
> introduce additional local date/time semantics into Avro. Here is the
> design doc:
>
> https://docs.google.com/document/d/1rLmb4-6G8LHBwHUU2P_8gE1o3lvMV0gSitnmiXXmlWY/edit?usp=sharing
>
> What are the thoughts on this? Please have a look at the POCs, and feel
> free to comment the design doc!
>
> Thanks,
> Nandor
>