You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Max Firman (Jira)" <ji...@apache.org> on 2019/12/09 11:04:00 UTC
[jira] [Created] (ARROW-7350) [Python] Parquet file metadata min
and max statistics not decoded from bytes for Decimal data types
Max Firman created ARROW-7350:
---------------------------------
Summary: [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types
Key: ARROW-7350
URL: https://issues.apache.org/jira/browse/ARROW-7350
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1
Reporter: Max Firman
Parquet file metadata for Decimal type columns contain min and max values that are not decoded from bytes into Decimals. This causes issues in dependent libraries like Dask (see [https://github.com/dask/dask/issues/5647]).
{code:python|title=Reproducible example|borderStyle=solid}
from decimal import Decimal
import random
import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa
NUM_DATA_POINTS_PER_PARTITION = 25
random.seed(0)
data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)]
df = pd.DataFrame(data1)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'my_data.parquet')
parquet_file = pq.ParquetFile('my_data.parquet')
assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, Decimal) # <-- AssertionError here because min has type bytes rather than Decimal
assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, Decimal)
{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)