You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Max Firman (Jira)" <ji...@apache.org> on 2019/12/09 11:04:00 UTC

[jira] [Created] (ARROW-7350) [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types

Max Firman created ARROW-7350:
---------------------------------

             Summary: [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types
                 Key: ARROW-7350
                 URL: https://issues.apache.org/jira/browse/ARROW-7350
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1
            Reporter: Max Firman


Parquet file metadata for Decimal type columns contain min and max values that are not decoded from bytes into Decimals. This causes issues in dependent libraries like Dask (see [https://github.com/dask/dask/issues/5647]).

 
{code:python|title=Reproducible example|borderStyle=solid}
from decimal import Decimal
import random

import pandas as pd
import pyarrow.parquet as pq
import pyarrow as pa

NUM_DATA_POINTS_PER_PARTITION = 25

random.seed(0)
data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)]

df = pd.DataFrame(data1)
table = pa.Table.from_pandas(df)
pq.write_table(table, 'my_data.parquet')

parquet_file = pq.ParquetFile('my_data.parquet')

assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, Decimal) # <-- AssertionError here because min has type bytes rather than Decimal
assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, Decimal)

{code}
 

 

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)