You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by Li Jin <ic...@gmail.com> on 2023/03/09 17:43:03 UTC

Timestamp unit in Substrait and Arrow

Hi,

I recently came across some limitations in expressing timestamp type with
Substrait in the Acero substrait consumer and am curious to hear what
people's thoughts are.

The particular issue that I have is when specifying timestamp type in
substrait, the unit is "microseconds" and there is no way to change that.
When integrating with Arrow, often we have timestamps in an internal system
that is of another unit, e.g., a flight service that returns a timestamp in
nanos. Also, interop with pandas, because pandas internally use
nanoseconds, that is another gap.

Currently as a result, we often need to convert from nanos <-> micro when a
substrait plan is involved to specify timestamps. It feels to me as
something missing in substrait but I wonder what other people think.

(Sending this to Arrow mailing list because I know some people here are
pretty involved with substrait and I am more familiar with the folks in the
Arrow community. Therefore wanted to get some thoughts from the people
here).

Li

Re: Timestamp unit in Substrait and Arrow

Posted by Li Jin <ic...@gmail.com>.
Thanks Weston for the insight - for the short term we are going to try to
unify the time unit to "microseconds" to be compatible with substrait and
pay the cost of converting to nanoseconds (e.g., when passing to pandas)
when needed.

Longer term I think option (3) is probably the most practical (although,
perhaps not worthwhile if the paying performance cost at
microseconds/nanoseconds convention isn't too bad in practice)



On Thu, Mar 9, 2023 at 1:36 PM Weston Pace <we...@gmail.com> wrote:

> The Substrait decision for microseconds was made because, at the time, the
> goal was to keep the type system simple and universal, and there were
> systems that didn't support ns (e.g. Iceberg, postgres, duckdb, velox).
>
> A few options (off the top of my head):
>
>  1. Attempt to get a nanoseconds timestamp type adopted in Substrait.
>
> I'm not sure how much enthusiasm there will be for this.  I think Acero is
> the only consumer that would take advantage of this.  Perhaps Ibis or
> Datafusion would have some interest.  It would require changing an old
> Substrait agreement around the rules for which data types to use.
>
> 2. Treat timestamp(ns) as a variation of timestamp(us).
>
> I'm listing this for thoroughness however I don't think we can do this.
> Substrate requires timestamps to be able to go out to the year 9999 and a
> 64-bit nanoseconds from the epoch cannot do this.
>
> 3. Treat timestamp(ns) as a user-defined type (from Substrait's
> perspective)
>
> This is probably the easiest approach in terms of consensus-building.  The
> Substrait consumer should already have the plumbing for this in
> src/arrow/engine/substrait/extension_types.h  I think getting Acero to work
> here will be pretty easy.  The trickier part might be adapting your
> producer (Ibis?)
>
> On Thu, Mar 9, 2023 at 9:43 AM Li Jin <ic...@gmail.com> wrote:
>
> > Hi,
> >
> > I recently came across some limitations in expressing timestamp type with
> > Substrait in the Acero substrait consumer and am curious to hear what
> > people's thoughts are.
> >
> > The particular issue that I have is when specifying timestamp type in
> > substrait, the unit is "microseconds" and there is no way to change that.
> > When integrating with Arrow, often we have timestamps in an internal
> system
> > that is of another unit, e.g., a flight service that returns a timestamp
> in
> > nanos. Also, interop with pandas, because pandas internally use
> > nanoseconds, that is another gap.
> >
> > Currently as a result, we often need to convert from nanos <-> micro
> when a
> > substrait plan is involved to specify timestamps. It feels to me as
> > something missing in substrait but I wonder what other people think.
> >
> > (Sending this to Arrow mailing list because I know some people here are
> > pretty involved with substrait and I am more familiar with the folks in
> the
> > Arrow community. Therefore wanted to get some thoughts from the people
> > here).
> >
> > Li
> >
>

Re: Timestamp unit in Substrait and Arrow

Posted by Weston Pace <we...@gmail.com>.
The Substrait decision for microseconds was made because, at the time, the
goal was to keep the type system simple and universal, and there were
systems that didn't support ns (e.g. Iceberg, postgres, duckdb, velox).

A few options (off the top of my head):

 1. Attempt to get a nanoseconds timestamp type adopted in Substrait.

I'm not sure how much enthusiasm there will be for this.  I think Acero is
the only consumer that would take advantage of this.  Perhaps Ibis or
Datafusion would have some interest.  It would require changing an old
Substrait agreement around the rules for which data types to use.

2. Treat timestamp(ns) as a variation of timestamp(us).

I'm listing this for thoroughness however I don't think we can do this.
Substrate requires timestamps to be able to go out to the year 9999 and a
64-bit nanoseconds from the epoch cannot do this.

3. Treat timestamp(ns) as a user-defined type (from Substrait's perspective)

This is probably the easiest approach in terms of consensus-building.  The
Substrait consumer should already have the plumbing for this in
src/arrow/engine/substrait/extension_types.h  I think getting Acero to work
here will be pretty easy.  The trickier part might be adapting your
producer (Ibis?)

On Thu, Mar 9, 2023 at 9:43 AM Li Jin <ic...@gmail.com> wrote:

> Hi,
>
> I recently came across some limitations in expressing timestamp type with
> Substrait in the Acero substrait consumer and am curious to hear what
> people's thoughts are.
>
> The particular issue that I have is when specifying timestamp type in
> substrait, the unit is "microseconds" and there is no way to change that.
> When integrating with Arrow, often we have timestamps in an internal system
> that is of another unit, e.g., a flight service that returns a timestamp in
> nanos. Also, interop with pandas, because pandas internally use
> nanoseconds, that is another gap.
>
> Currently as a result, we often need to convert from nanos <-> micro when a
> substrait plan is involved to specify timestamps. It feels to me as
> something missing in substrait but I wonder what other people think.
>
> (Sending this to Arrow mailing list because I know some people here are
> pretty involved with substrait and I am more familiar with the folks in the
> Arrow community. Therefore wanted to get some thoughts from the people
> here).
>
> Li
>