You are viewing a plain text version of this content. The canonical link for it is here.

Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2019/12/10 14:27:00 UTC

[jira] [Commented] (ARROW-7350) [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types

    [ https://issues.apache.org/jira/browse/ARROW-7350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16992597#comment-16992597 ] 

Joris Van den Bossche commented on ARROW-7350:
----------------------------------------------

[~max.firman] Thanks for the report!

Such a conversion would fit in the {{_box_logical_type_value}} function (https://github.com/apache/arrow/blob/master/python/pyarrow/_parquet.pyx#L250-L294) that already handles conversion of raw value to python types for eg timestamps.

I would only need to check if we have some conversion utility from bytes to Decimal already.

> [Python] Parquet file metadata min and max statistics not decoded from bytes for Decimal data types
> ---------------------------------------------------------------------------------------------------
>
>                 Key: ARROW-7350
>                 URL: https://issues.apache.org/jira/browse/ARROW-7350
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.15.1
>            Reporter: Max Firman
>            Priority: Major
>
> Parquet file metadata for Decimal type columns contain min and max values that are not decoded from bytes into Decimals. This causes issues in dependent libraries like Dask (see [https://github.com/dask/dask/issues/5647]).
>  
> {code:python|title=Reproducible example|borderStyle=solid}
> from decimal import Decimal
> import random
> import pandas as pd
> import pyarrow.parquet as pq
> import pyarrow as pa
> NUM_DATA_POINTS_PER_PARTITION = 25
> random.seed(0)
> data1 = [{"col1": Decimal(f"{random.randint(0, 999)}.{random.randint(0, 99)}")} for i in range(NUM_DATA_POINTS_PER_PARTITION)]
> df = pd.DataFrame(data1)
> table = pa.Table.from_pandas(df)
> pq.write_table(table, 'my_data.parquet')
> parquet_file = pq.ParquetFile('my_data.parquet')
> assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.min, Decimal) # <-- AssertionError here because min has type bytes rather than Decimal
> assert isinstance(parquet_file.metadata.row_group(0).column(0).statistics.max, Decimal)
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)