You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Joris Van den Bossche (Jira)" <ji...@apache.org> on 2021/11/23 12:49:00 UTC
[jira] [Commented] (ARROW-14767) Categorical int8 index types written as int32 in parquet files
[ https://issues.apache.org/jira/browse/ARROW-14767?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447989#comment-17447989 ]
Joris Van den Bossche commented on ARROW-14767:
-----------------------------------------------
Thanks for the report!
I can reproduce your issue. Looking at the written Parquet file's schema, it actually uses int8:
{code}
In [29]: parquet_metadata = pq.ParquetFile(
...: filesystem.open_input_file(where),
...: )
In [30]: parquet_metadata.schema
Out[30]:
<pyarrow._parquet.ParquetSchema object at 0x7f60be97fe00>
required group field_id=-1 schema {
optional int32 field_id=-1 A (Int(bitWidth=8, isSigned=true));
optional binary field_id=-1 B (String);
}
{code}
So that means that this happens on the read side.
I suppose reading dictionary encoded values into dictionary type doesn't (yet?) support preserving the integer indices bitwidth, and using the default int32 that Arrow uses.
> Categorical int8 index types written as int32 in parquet files
> --------------------------------------------------------------
>
> Key: ARROW-14767
> URL: https://issues.apache.org/jira/browse/ARROW-14767
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python
> Affects Versions: 5.0.0
> Environment: NAME="CentOS Linux"
> VERSION="7 (Core)"
> Reporter: Gavin
> Priority: Minor
>
> When converting from a pandas dataframe to a table, categorical variables are by default given an index type int8 (presumably because there are fewer than 128 categories) in the schema. When this is written to a parquet file, the schema changes such that the index type is int32 instead. This causes an inconsistency between the schemas of tables derived from pandas and those read from disk.
> A minimal recreation of the issue is as follows:
> {code:java}
> import numpy as np
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> df = pd.DataFrame({"A": [1, 2, 3, 4, 5], "B": ["a", "a", "b", "c", "b"]})
> dtypes = {
> "A": np.dtype("int8"),
> "B": pd.CategoricalDtype(categories=["a", "b", "c"], ordered=None),
> }
> df = df.astype(dtypes)
> tbl = pa.Table.from_pandas(
> df,
> )
> where = "tmp.parquet"
> filesystem = pa.fs.LocalFileSystem()
> pq.write_table(
> tbl,
> filesystem.open_output_stream(
> where,
> compression=None,
> ),
> version="2.0",
> )
> schema = tbl.schema
> read_schema = pq.ParquetFile(
> filesystem.open_input_file(where),
> ).schema_arrow{code}
> By printing schema and read_schema, you can the inconsistency.
> I have workarounds in place for this, but am raising the issue anyway so that you can resolve it properly.
--
This message was sent by Atlassian Jira
(v8.20.1#820001)