You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2020/02/04 09:14:00 UTC
[jira] [Updated] (ARROW-7732) [C++] Parquet statistics wrong for
dictionary type
[ https://issues.apache.org/jira/browse/ARROW-7732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Joris Van den Bossche updated ARROW-7732:
-----------------------------------------
Summary: [C++] Parquet statistics wrong for dictionary type (was: [Python][C++] Parquet statistics wrong for pandas Categorical)
> [C++] Parquet statistics wrong for dictionary type
> --------------------------------------------------
>
> Key: ARROW-7732
> URL: https://issues.apache.org/jira/browse/ARROW-7732
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 0.16.0, 0.15.1
> Reporter: Florian Jetter
> Priority: Major
>
> h3. Observed behaviour
> Statistics for categorical data are equivalent for all row groups and refer to the entire {{CategoricalDtype}} instead of the data included in the row group.
> h3. Expected behaviour
> The row group statistics should only include data which is part of the actual row group, not the entire {{CategoricalDtype}}
> h3. Minimal example
> {code:python}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
> table = pa.Table.from_pandas(test_df)
> pq.write_table(
> table,
> "test_parquet",
> chunk_size=1,
> )
> test_parquet = pq.ParquetFile("test_parquet")
> test_parquet.metadata.row_group(0).column(0).statistics
> {code}
> {code:java}
> Out[1]:
> <pyarrow._parquet.Statistics object at 0x1163b5280>
> has_min_max: True
> min: 1
> max: 42
> null_count: 0
> distinct_count: 0
> num_values: 1
> physical_type: BYTE_ARRAY
> logical_type: String
> converted_type (legacy): UTF8
> {code}
> Expected would be
> {{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
>
> Tested with
> pandas==1.0.0
> pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)