You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yishai Beeri (Jira)" <ji...@apache.org> on 2022/09/06 06:04:00 UTC
[jira] [Updated] (ARROW-17625) Cast error on roundtrip of categorical column to parquet and back
[ https://issues.apache.org/jira/browse/ARROW-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yishai Beeri updated ARROW-17625:
---------------------------------
Description:
Writing a table to parquet, then reading it back fails if:
# One of the columns is a dictionary (came from a pandas Categorical), *and*
# Passing the table's schema to `read_table`
Failing on attempt to cast int64 into dictionary (full stack trace below).
This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.
Minimal example of failing code:
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
a = [1,2,3,4,1,2,3,4,1,2,3,4]
b = ["a" for i in a]
c = [i for i in range(len(a))]
df = pd.DataFrame({"a":a, "b":b, "c":c})
df['a'] = df['a'].astype('category')
print("df dtypes:\n", df.dtypes)
t = pa.Table.from_pandas(df, preserve_index=True)
s = t.schema
ds.write_dataset(t, format='parquet', base_dir='./test')
df2 = pq.read_table('./test', schema=s).to_pandas()
print("df2 dtypes:\n", df2.dtypes)
{code}
Which gives:
{code:java}
df dtypes:
a category
b object
c int64
dtype: object
Traceback (most recent call last):
File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
df2 = pq.read_table('./test', schema=s).to_pandas()
File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2827, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2473, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary
{code}
was:
Writing a table to parquet, then reading it back fails if:
# One of the columns is a dictionary (came from a pandas Categorical), *and*
# Passing the table's schema to `read_table`
Failing on attempt to cast int64 into dictionary (full stack trace below).
This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.
Minimal example of failing code:
```
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
a = [1,2,3,4,1,2,3,4,1,2,3,4]
b = ["a" for i in a]
c = [i for i in range(len(a))]
df = pd.DataFrame(\{"a":a, "b":b, "c":c})
df['a'] = df['a'].astype('category')
print("df dtypes:\n", df.dtypes)
t = pa.Table.from_pandas(df, preserve_index=True)
s = t.schema
ds.write_dataset(t, format='parquet', base_dir='./test')
df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()
print("df2 dtypes:\n", df2.dtypes)
```
Which gives:
```
df dtypes:
a category
b object
c int64
dtype: object
Traceback (most recent call last):
File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()
File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
return dataset.read(columns=columns, use_threads=use_threads,
File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
table = self._dataset.to_table(
File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary
```
> Cast error on roundtrip of categorical column to parquet and back
> -----------------------------------------------------------------
>
> Key: ARROW-17625
> URL: https://issues.apache.org/jira/browse/ARROW-17625
> Project: Apache Arrow
> Issue Type: Bug
> Components: Parquet, Python
> Affects Versions: 9.0.0
> Reporter: Yishai Beeri
> Priority: Major
> Labels: Parquet, categorical
>
> Writing a table to parquet, then reading it back fails if:
> # One of the columns is a dictionary (came from a pandas Categorical), *and*
> # Passing the table's schema to `read_table`
> Failing on attempt to cast int64 into dictionary (full stack trace below).
> This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.
> Minimal example of failing code:
>
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> a = [1,2,3,4,1,2,3,4,1,2,3,4]
> b = ["a" for i in a]
> c = [i for i in range(len(a))]
> df = pd.DataFrame({"a":a, "b":b, "c":c})
> df['a'] = df['a'].astype('category')
> print("df dtypes:\n", df.dtypes)
> t = pa.Table.from_pandas(df, preserve_index=True)
> s = t.schema
> ds.write_dataset(t, format='parquet', base_dir='./test')
> df2 = pq.read_table('./test', schema=s).to_pandas()
> print("df2 dtypes:\n", df2.dtypes)
> {code}
>
>
> Which gives:
>
> {code:java}
> df dtypes:
> a category
> b object
> c int64
> dtype: object
> Traceback (most recent call last):
> File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
> df2 = pq.read_table('./test', schema=s).to_pandas()
> File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2827, in read_table
> return dataset.read(columns=columns, use_threads=use_threads,
> File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2473, in read
> table = self._dataset.to_table(
> File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
> File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
> File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
> File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary
> {code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)