You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by David Boles <bi...@gmail.com> on 2019/10/09 19:27:12 UTC

Question about timestamps ...

The following code dies with pyarrow 0.14.2:

import pyarrow as pa
import pyarrow.parquet as pq

schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')

ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
tz='UTC'))
table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])

writer.write_table(table)
writer.close()

with the message:

ValueError: Invalid value for coerce_timestamps: ns

That appears to be because of this code in _parquet.pxi:

    cdef int _set_coerce_timestamps(
            self, ArrowWriterProperties.Builder* props) except -1:
        if self.coerce_timestamps == 'ms':
            props.coerce_timestamps(TimeUnit_MILLI)
        elif self.coerce_timestamps == 'us':
            props.coerce_timestamps(TimeUnit_MICRO)
        elif self.coerce_timestamps is not None:
            raise ValueError('Invalid value for coerce_timestamps: {0}'
                             .format(self.coerce_timestamps))

which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
intentional, or a bug?

Thanks,

 - db

Re: Question about timestamps ...

Posted by David Boles <bi...@gmail.com>.
Joris,

Thank you for the response. There's such a trail of stale information
online w/r to the overall that it wasn't clear what the status was. For
example, simple searches take you into the "INT96 is deprecated therefore
suppport for nanoseconds is as well" cul-de-sac. Absence that confusing
context, the existing error message is fine.

It's worth noting that accurate and precise timestamps down to ~0.1
nanosecond are widely available, with 0.02ns being available for just a few
thousand $US.

I'll stick with usec resolution for absolute time and just use an int64
field for my nanosecond data.

Thanks again.

 - db

On Thu, Oct 10, 2019 at 5:11 AM Joris Van den Bossche <
jorisvandenbossche@gmail.com> wrote:

> Hi David,
>
> This is intentional, see
> https://arrow.apache.org/docs/python/parquet.html#storing-timestamps for
> some explanation in the documentation. Basicly, the parquet format only
> supports ms and us resolution, and so nanosecond timestamps (which are
> supported by Arrow) are converted to one of those resolutions.
>
> We could maybe clarify that better in the error message (something like
> "only 'ms' and 'us' are supported") ?
>
> In the latest version of the parquet format specification, there is
> actually support for nanosecond resolution as well (
>
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype
> ).
> You can obtain this by specifying version="2.0", but the implementation is
> not yet fully ready (see https://issues.apache.org/jira/browse/PARQUET-458
> ),
> and also not all frameworks support this version (so if compatibility
> across processing frameworks is important, it is recommended to stick with
> version 1).
>
> Joris
>
> On Wed, 9 Oct 2019 at 21:27, David Boles <bi...@gmail.com> wrote:
>
> > The following code dies with pyarrow 0.14.2:
> >
> > import pyarrow as pa
> > import pyarrow.parquet as pq
> >
> > schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
> > writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')
> >
> > ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
> > tz='UTC'))
> > table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])
> >
> > writer.write_table(table)
> > writer.close()
> >
> > with the message:
> >
> > ValueError: Invalid value for coerce_timestamps: ns
> >
> > That appears to be because of this code in _parquet.pxi:
> >
> >     cdef int _set_coerce_timestamps(
> >             self, ArrowWriterProperties.Builder* props) except -1:
> >         if self.coerce_timestamps == 'ms':
> >             props.coerce_timestamps(TimeUnit_MILLI)
> >         elif self.coerce_timestamps == 'us':
> >             props.coerce_timestamps(TimeUnit_MICRO)
> >         elif self.coerce_timestamps is not None:
> >             raise ValueError('Invalid value for coerce_timestamps: {0}'
> >                              .format(self.coerce_timestamps))
> >
> > which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
> > else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
> > intentional, or a bug?
> >
> > Thanks,
> >
> >  - db
> >
>

Re: Question about timestamps ...

Posted by Joris Van den Bossche <jo...@gmail.com>.
Hi David,

This is intentional, see
https://arrow.apache.org/docs/python/parquet.html#storing-timestamps for
some explanation in the documentation. Basicly, the parquet format only
supports ms and us resolution, and so nanosecond timestamps (which are
supported by Arrow) are converted to one of those resolutions.

We could maybe clarify that better in the error message (something like
"only 'ms' and 'us' are supported") ?

In the latest version of the parquet format specification, there is
actually support for nanosecond resolution as well (
https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#deprecated-timestamp-convertedtype).
You can obtain this by specifying version="2.0", but the implementation is
not yet fully ready (see https://issues.apache.org/jira/browse/PARQUET-458),
and also not all frameworks support this version (so if compatibility
across processing frameworks is important, it is recommended to stick with
version 1).

Joris

On Wed, 9 Oct 2019 at 21:27, David Boles <bi...@gmail.com> wrote:

> The following code dies with pyarrow 0.14.2:
>
> import pyarrow as pa
> import pyarrow.parquet as pq
>
> schema = pa.schema([('timestamp', pa.timestamp('ns', tz='UTC')),])
> writer = pq.ParquetWriter('foo.parquet', schema, coerce_timestamps='ns')
>
> ts_array = pa.array([ int(1234567893141) ], type=pa.timestamp('ns',
> tz='UTC'))
> table = pa.Table.from_arrays([ ts_array ], names=['timestamp'])
>
> writer.write_table(table)
> writer.close()
>
> with the message:
>
> ValueError: Invalid value for coerce_timestamps: ns
>
> That appears to be because of this code in _parquet.pxi:
>
>     cdef int _set_coerce_timestamps(
>             self, ArrowWriterProperties.Builder* props) except -1:
>         if self.coerce_timestamps == 'ms':
>             props.coerce_timestamps(TimeUnit_MILLI)
>         elif self.coerce_timestamps == 'us':
>             props.coerce_timestamps(TimeUnit_MICRO)
>         elif self.coerce_timestamps is not None:
>             raise ValueError('Invalid value for coerce_timestamps: {0}'
>                              .format(self.coerce_timestamps))
>
> which restricts the choice to 'ms' or 'us', even though AFAICT everywhere
> else also allows 'ns' (and there is a TimeUnit_NANO defined). Is this
> intentional, or a bug?
>
> Thanks,
>
>  - db
>