You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Yishai Beeri (Jira)" <ji...@apache.org> on 2022/09/06 06:01:00 UTC
[jira] [Created] (ARROW-17625) Cast error on roundtrip of categorical column to parquet and back
Yishai Beeri created ARROW-17625:
------------------------------------
Summary: Cast error on roundtrip of categorical column to parquet and back
Key: ARROW-17625
URL: https://issues.apache.org/jira/browse/ARROW-17625
Project: Apache Arrow
Issue Type: Bug
Components: Parquet, Python
Affects Versions: 9.0.0
Reporter: Yishai Beeri
Writing a table to parquet, then reading it back fails if:
# One of the columns is a dictionary (came from a pandas Categorical), *and*
# Passing the table's schema to `read_table`
Failing on attempt to cast int64 into dictionary (full stack trace below).
This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.
Minimal example of failing code:
```
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
a = [1,2,3,4,1,2,3,4,1,2,3,4]
b = ["a" for i in a]
c = [i for i in range(len(a))]
df = pd.DataFrame(\{"a":a, "b":b, "c":c})
df['a'] = df['a'].astype('category')
print("df dtypes:\n", df.dtypes)
t = pa.Table.from_pandas(df, preserve_index=True)
s = t.schema
ds.write_dataset(t, format='parquet', base_dir='./test')
df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()
print("df2 dtypes:\n", df2.dtypes)
```
Which gives:
```
df dtypes:
a category
b object
c int64
dtype: object
Traceback (most recent call last):
File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()
File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary
```
--
This message was sent by Atlassian Jira
(v8.20.10#820010)