You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Wes McKinney (Jira)" <ji...@apache.org> on 2019/08/21 23:44:00 UTC
[jira] [Closed] (ARROW-4967) [C++] Parquet: Object type and stats
lost when using 96-bit timestamps
[ https://issues.apache.org/jira/browse/ARROW-4967?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Wes McKinney closed ARROW-4967.
-------------------------------
Resolution: Won't Fix
Computation of statistics is disabled for INT96. We don't intend to do anything about this AFAIK cc [~mdeepak]
> [C++] Parquet: Object type and stats lost when using 96-bit timestamps
> ----------------------------------------------------------------------
>
> Key: ARROW-4967
> URL: https://issues.apache.org/jira/browse/ARROW-4967
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.12.1
> Environment: PyArrow: 0.12.1
> Python: 2.7.15, 3.7.2
> Pandas: 0.24.2
> Reporter: Diego Argueta
> Priority: Minor
> Labels: parquet
>
> Run the following code:
> {code:python}
> import datetime as dt
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> dataframe = pd.DataFrame({'foo': [dt.datetime.now()]})
> table = pa.Table.from_pandas(dataframe, preserve_index=False)
> pq.write_table(table, 'int64.parq')
> pq.write_table(table, 'int96.parq', use_deprecated_int96_timestamps=True)
> {code}
> Examining the {{int64.parq}} file, we see that the column metadata includes an object type of {{TIMESTAMP_MICROS}} and also gives some stats. All is well.
> {code}
> file schema: schema
> --------------------------------------------------------------------------------
> foo: OPTIONAL INT64 O:TIMESTAMP_MICROS R:0 D:1
> row group 1: RC:1 TS:76 OFFSET:4
> --------------------------------------------------------------------------------
> foo: INT64 SNAPPY ... ST:[min: 2019-12-31T23:59:59.999000, max: 2019-12-31T23:59:59.999000, num_nulls: 0]
> {code}
> However, if we look at {{int96.parq}}, it appears that that metadata is lost. No object type, and no column stats.
> {code}
> file schema: schema
> --------------------------------------------------------------------------------
> foo: OPTIONAL INT96 R:0 D:1
> row group 1: RC:1 TS:58 OFFSET:4
> --------------------------------------------------------------------------------
> foo: INT96 SNAPPY ... ST:[no stats for this column]
> {code}
> This is a bit confusing since the metadata for the exact same data can look differently depending on an unrelated flag being set or cleared.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)