You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/08/21 23:44:00 UTC

[jira] [Closed] (ARROW-4967) [C++] Parquet: Object type and stats lost when using 96-bit timestamps

     [ https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Wes McKinney closed ARROW-4967.
-------------------------------
    Resolution: Won't Fix

Computation of statistics is disabled for INT96. We don't intend to do anything about this AFAIK cc [~mdeepak]

> [C++] Parquet: Object type and stats lost when using 96-bit timestamps
> ----------------------------------------------------------------------
>
>                 Key: ARROW-4967
>                 URL: https://issues.apache.org/jira/browse/ARROW-4967
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.12.1
>         Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
>            Reporter: Diego Argueta
>            Priority: Minor
>              Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema 
> --------------------------------------------------------------------------------
> foo:         OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4 
> --------------------------------------------------------------------------------
> foo:          INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost. No object type, and no column stats.
> {code}
> file schema: schema 
> --------------------------------------------------------------------------------
> foo:         OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4 
> --------------------------------------------------------------------------------
> foo:          INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look differently depending on an unrelated flag being set or cleared.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)