You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Raphael Taylor-Davies (Jira)" <ji...@apache.org> on 2022/04/13 10:21:00 UTC

[jira] [Created] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

Raphael Taylor-Davies created ARROW-16184:
---------------------------------------------

             Summary: [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
                 Key: ARROW-16184
                 URL: https://issues.apache.org/jira/browse/ARROW-16184
             Project: Apache Arrow
          Issue Type: Bug
            Reporter: Raphael Taylor-Davies


As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file.
#!/usr/bin/env pythonimport pyarrow as paimport pyarrow.parquet as pqimport pandas as pd# create DataFrame with a datetime columndf = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
df['created'] = pd.to_datetime(df['created'])# create Arrow table from DataFrametable = pa.Table.from_pandas(df, preserve_index=False)# write the table as a parquet file, then read it back againpq.write_table(table, 'foo.parquet')
table2 = pq.read_table('foo.parquet')print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond units)print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond units)
 

This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array.

 

This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - https://github.com/apache/arrow-rs/issues/1459

 

 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)