You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Florian Jetter (Jira)" <ji...@apache.org> on 2020/01/31 10:23:00 UTC
[jira] [Created] (ARROW-7732) [Python][C++] Parquet statistics
wrong for pandas Categorical
Florian Jetter created ARROW-7732:
-------------------------------------
Summary: [Python][C++] Parquet statistics wrong for pandas Categorical
Key: ARROW-7732
URL: https://issues.apache.org/jira/browse/ARROW-7732
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 0.15.1, 0.16.0
Reporter: Florian Jetter
h3. Observed behaviour
Statistics for categorical data are equivalent for all row groups and refer to the entire {{CategoricalDtype}} instead of the data included in the row group.
h3. Expected behaviour
The row group statistics should only include data which is part of the actual row group, not the entire {{CategoricalDtype}}
h3. Minimal example
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
table = pa.Table.from_pandas(test_df)
pq.write_table(
table,
"test_parquet",
chunk_size=1,
)
test_parquet = pq.ParquetFile("test_parquet")
test_parquet.metadata.row_group(0).column(0).statistics
{code}
{code:java}
Out[1]:
<pyarrow._parquet.Statistics object at 0x1163b5280>
has_min_max: True
min: 1
max: 42
null_count: 0
distinct_count: 0
num_values: 1
physical_type: BYTE_ARRAY
logical_type: String
converted_type (legacy): UTF8
{code}
Expected would be
{{min:1}} {{max:1}} instead of {{max: 42}} for the first row group
Tested with
pandas==1.0.0
pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0)
--
This message was sent by Atlassian Jira
(v8.3.4#803005)