You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/04/15 13:34:00 UTC
[jira] [Commented] (ARROW-16184) [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet

    [ https://issues.apache.org/jira/browse/ARROW-16184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17522831#comment-17522831 ] 

Joris Van den Bossche commented on ARROW-16184:
-----------------------------------------------

> however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This causes issues depending on which schema the reader opts to "trust".

[~tustvold] I think you need to see the stored Arrow schema as "the schema of the original Arrow data" that was used to write the Parquet file. In that sense, the schema _is_ correct (the original Arrow data _did_ have nanosecond resolution). 
So that also means that this stored Arrow schema doesn't necessarily say anything about the data that is actually stored in the Parquet file. This is one example where there is a difference, but there are also other examples (eg extension types, duration, fixed sized list, ... are all types that are not directly supported in parquet, and thus would give a difference between the Parquet schema and the stored Arrow schema)

> [Python] Incorrect Timestamp Unit in Embedded Arrow Schema Within Parquet
> -------------------------------------------------------------------------
>
>                 Key: ARROW-16184
>                 URL: https://issues.apache.org/jira/browse/ARROW-16184
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>            Reporter: Raphael Taylor-Davies
>            Priority: Minor
>
> As pointed out in https://issues.apache.org/jira/browse/ARROW-2429 the following code results in the schema changing when reading/writing a parquet file.
> {code:python}
> #!/usr/bin/env python
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pandas as pd
> # create DataFrame with a datetime column
> df = pd.DataFrame({'created': ['2018-04-04T10:14:14Z']})
> df['created'] = pd.to_datetime(df['created'])
> # create Arrow table from DataFrame
> table = pa.Table.from_pandas(df, preserve_index=False)
> # write the table as a parquet file, then read it back again
> pq.write_table(table, 'foo.parquet')
> table2 = pq.read_table('foo.parquet')
> print(table.schema[0])  # pyarrow.Field<created: timestamp[ns]> (nanosecond units)
> print(table2.schema[0]) # pyarrow.Field<created: timestamp[us]> (microsecond units)
> {code}
> This was closed as a limitation of the parquet 1.x format for representing nanosecond timestamps. This is fine, however, the arrow schema embedded within the parquet metadata still lists the data as being a nanosecond array. This causes issues depending on which schema the reader opts to "trust".
> This was discovered as part of the investigation into a bug report on the arrow-rs parquet implementation - [https://github.com/apache/arrow-rs/issues/1459]
> Specifically the metadata written is
> {code:java}
> Schema {
>     endianness: Little,
>     fields: Some(
>         [
>             Field {
>                 name: Some(
>                     "created",
>                 ),
>                 nullable: true,
>                 type_type: Timestamp,
>                 type_: Timestamp {
>                     unit: NANOSECOND,
>                     timezone: Some(
>                         "UTC",
>                     ),
>                 },
>                 dictionary: None,
>                 children: Some(
>                     [],
>                 ),
>                 custom_metadata: None,
>             },
>         ],
>     ),
>     custom_metadata: Some(
>         [
>             KeyValue {
>                 key: Some(
>                     "pandas",
>                 ),
>                 value: Some(
>                     "{\"index_columns\": [], \"column_indexes\": [], \"columns\": [{\"name\": \"created\", \"field_name\": \"created\", \"pandas_type\": \"datetimetz\", \"numpy_type\": \"datetime64[ns]\", \"metadata\": {\"timezone\": \"UTC\"}}], \"creator\": {\"library\": \"pyarrow\", \"version\": \"6.0.1\"}, \"pandas_version\": \"1.4.0\"}",
>                 ),
>             },
>         ],
>     ),
>     features: None,
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.1#820001)