You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Yishai Beeri (Jira)" <ji...@apache.org> on 2022/09/06 06:01:00 UTC

[jira] [Created] (ARROW-17625) Cast error on roundtrip of categorical column to parquet and back

Yishai Beeri created ARROW-17625:
------------------------------------

             Summary: Cast error on roundtrip of categorical column to parquet and back
                 Key: ARROW-17625
                 URL: https://issues.apache.org/jira/browse/ARROW-17625
             Project: Apache Arrow
          Issue Type: Bug
          Components: Parquet, Python
    Affects Versions: 9.0.0
            Reporter: Yishai Beeri


Writing a table to parquet, then reading it back fails if:
 # One of the columns is a dictionary (came from a pandas Categorical), *and*
 # Passing the table's schema to `read_table`

Failing on attempt to cast int64 into dictionary (full stack trace below).

This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.

Minimal example of failing code:

```

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

a = [1,2,3,4,1,2,3,4,1,2,3,4]
b = ["a" for i in a]
c = [i for i in range(len(a))]

df = pd.DataFrame(\{"a":a, "b":b, "c":c})
df['a'] = df['a'].astype('category')

print("df dtypes:\n", df.dtypes)

t = pa.Table.from_pandas(df, preserve_index=True)
s = t.schema

ds.write_dataset(t, format='parquet', base_dir='./test')

df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()

print("df2 dtypes:\n", df2.dtypes)

```

Which gives: 

```

df dtypes:
 a    category
b      object
c       int64
dtype: object
Traceback (most recent call last):
  File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
    df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()
  File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary

```

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)