You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Gavin (Jira)" <ji...@apache.org> on 2021/11/18 13:47:00 UTC
[jira] [Created] (ARROW-14767) Categorical int8 index types written as int32 in parquet files
Gavin created ARROW-14767:
-----------------------------
Summary: Categorical int8 index types written as int32 in parquet files
Key: ARROW-14767
URL: https://issues.apache.org/jira/browse/ARROW-14767
Project: Apache Arrow
Issue Type: Bug
Components: Python
Affects Versions: 5.0.0
Environment: NAME="CentOS Linux"
VERSION="7 (Core)"
Reporter: Gavin
When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.
A minimal recreation of the issue is as follows:
{code:java}
import numpy as np
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
dtypes = {
"A": np.dtype("int8"),
"B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
}
df = df.astype(dtypes)
tbl = pa.Table.from_pandas(
df,
)
where = "tmp.parquet"
filesystem = pa.fs.LocalFileSystem()
pq.write_table(
tbl,
filesystem.open_output_stream(
where,
compression=None,
),
version="2.0",
)
schema = tbl.schema
read_schema = pq.ParquetFile(
filesystem.open_input_file(where), #buffer_size=_BUFFER_SIZE, pre_buffer=True
).schema_arrow{code}
By printing schema and read_schema, you can the inconsistency.
I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)