You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@arrow.apache.org by "Florian Jetter (Jira)" <ji...@apache.org> on 2020/01/31 10:23:00 UTC

[jira] [Created] (ARROW-7732) [Python][C++] Parquet statistics wrong for pandas Categorical

Florian Jetter created ARROW-7732:
-------------------------------------

             Summary: [Python][C++] Parquet statistics wrong for pandas Categorical
                 Key: ARROW-7732
                 URL: https://issues.apache.org/jira/browse/ARROW-7732
             Project: Apache Arrow
          Issue Type: Bug
          Components: Python
    Affects Versions: 0.15.1, 0.16.0
            Reporter: Florian Jetter


h3. Observed behaviour

Statistics for categorical data are equivalent for all row groups and refer to the entire {{CategoricalDtype}} instead of the data included in the row group.
h3. Expected behaviour

The row group statistics should only include data which is part of the actual row group, not the entire {{CategoricalDtype}}
h3. Minimal example
{code:python}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
test_df = pd.DataFrame({"categorical": pd.Categorical(["1", "42"])})
table = pa.Table.from_pandas(test_df)
pq.write_table(
    table,
    "test_parquet",
    chunk_size=1,
)
test_parquet = pq.ParquetFile("test_parquet")
test_parquet.metadata.row_group(0).column(0).statistics
{code}
{code:java}
Out[1]:
<pyarrow._parquet.Statistics object at 0x1163b5280>
  has_min_max: True
  min: 1
  max: 42
  null_count: 0
  distinct_count: 0
  num_values: 1
  physical_type: BYTE_ARRAY
  logical_type: String
  converted_type (legacy): UTF8
{code}
Expected would be

{{min:1}} {{max:1}} instead of {{max: 42}} for the first row group

 

Tested with 
 pandas==1.0.0
 pyarrow==bd08d0ecbe355b9e0de7d07e8b9ff6ccdb150e73 (current master / essentially 0.16.0)



--
This message was sent by Atlassian Jira
(v8.3.4#803005)