You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/02/04 09:14:00 UTC

[jira] [Updated] (ARROW-7732) [C++] Parquet statistics wrong for dictionary type

     [ https://issues.apache.org/jira/browse/ARROW-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joris Van den Bossche updated ARROW-7732:
-----------------------------------------
    Summary: [C++] Parquet statistics wrong for dictionary type  (was: [Python][C++] Parquet statistics wrong for pandas Categorical)

> [C++] Parquet statistics wrong for dictionary type
> --------------------------------------------------
>
>                 Key: ARROW-7732
>                 URL: https://issues.apache.org/jira/browse/ARROW-7732
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Python
>    Affects Versions: 0.16.0, 0.15.1
>            Reporter: Florian Jetter
>            Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer to the entire {{CategoricalDtype}} instead of the data included in the row group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
>     table,
>     "test_parquet",
>     chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> <pyarrow._parquet.Statistics object at 0x1163b5280>
>   has_min_max: True
>   min: 1
>   max: 42
>   null_count: 0
>   distinct_count: 0
>   num_values: 1
>   physical_type: BYTE_ARRAY
>   logical_type: String
>   converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>  
> Tested with 
>  pandas==1.0.0
>  pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)