You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2022/04/05 13:20:00 UTC

[jira] [Updated] (ARROW-7350) [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types

     [ https://issues.apache.org/jira/browse/ARROW-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-7350:
-----------------------------------------
    Fix Version/s: 8.0.0

> [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7350
>                 URL: https://issues.apache.org/jira/browse/ARROW-7350
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>            Reporter: Max Firman
>            Priority: Major
>             Fix For: 8.0.0
>
>
> Parquet file metadata for Decimal type columns contain min and max values that are not decoded from bytes into Decimals. This causes issues in dependent libraries like Dask (see [https://github.com/dask/dask/issues/5647]).
>  
> {code:python|title=Reproducible example|borderStyle=solid}
> from decimal import Decimal
> import random
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow as pa
> NUM_DATA_POINTS_PER_PARTITION = 25
> random.seed(0)
> data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)]
> df = pd.DataFrame(data1)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'my_data.parquet')
> parquet_file = pq.ParquetFile('my_data.parquet')
> assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, Decimal) # <-- AssertionError here because min has type bytes rather than Decimal
> assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, Decimal)
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)