You are viewing a plain text version of this content. The canonical link for it is here.
Posted to jira@arrow.apache.org by "Yishai Beeri (Jira)" <ji...@apache.org> on 2022/09/06 06:04:00 UTC
[jira] [Updated] (ARROW-17625) Cast error on roundtrip of categorical column to parquet and back

     [ https://issues.apache.org/jira/browse/ARROW-17625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Yishai Beeri updated ARROW-17625:
---------------------------------
    Description: 
Writing a table to parquet, then reading it back fails if:
 # One of the columns is a dictionary (came from a pandas Categorical), *and*
 # Passing the table's schema to `read_table`

Failing on attempt to cast int64 into dictionary (full stack trace below).

This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.

Minimal example of failing code:

 
{code:java}
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds
a = [1,2,3,4,1,2,3,4,1,2,3,4]
b = ["a" for i in a]
c = [i for i in range(len(a))]
df = pd.DataFrame({"a":a, "b":b, "c":c})
df['a'] = df['a'].astype('category')
print("df dtypes:\n", df.dtypes)
t = pa.Table.from_pandas(df, preserve_index=True)
s = t.schema
ds.write_dataset(t, format='parquet', base_dir='./test')
df2 = pq.read_table('./test', schema=s).to_pandas()
print("df2 dtypes:\n", df2.dtypes)
{code}
 

 

Which gives: 

 
{code:java}
df dtypes:
 a    category
b      object
c       int64
dtype: object
Traceback (most recent call last):
  File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
    df2 = pq.read_table('./test', schema=s).to_pandas()
  File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2827, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2473, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary
{code}

  was:
Writing a table to parquet, then reading it back fails if:
 # One of the columns is a dictionary (came from a pandas Categorical), *and*
 # Passing the table's schema to `read_table`

Failing on attempt to cast int64 into dictionary (full stack trace below).

This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.

Minimal example of failing code:

```

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
import pyarrow.dataset as ds

a = [1,2,3,4,1,2,3,4,1,2,3,4]
b = ["a" for i in a]
c = [i for i in range(len(a))]

df = pd.DataFrame(\{"a":a, "b":b, "c":c})
df['a'] = df['a'].astype('category')

print("df dtypes:\n", df.dtypes)

t = pa.Table.from_pandas(df, preserve_index=True)
s = t.schema

ds.write_dataset(t, format='parquet', base_dir='./test')

df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()

print("df2 dtypes:\n", df2.dtypes)

```

Which gives: 

```

df dtypes:
 a    category
b      object
c       int64
dtype: object
Traceback (most recent call last):
  File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
    df2 = pq.read_table('./test', schema=s, use_pandas_metadata=True).to_pandas()
  File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2827, in read_table
    return dataset.read(columns=columns, use_threads=use_threads,
  File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/__init__.py", line 2473, in read
    table = self._dataset.to_table(
  File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
  File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
  File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
  File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary

```

 


> Cast error on roundtrip of categorical column to parquet and back
> -----------------------------------------------------------------
>
>                 Key: ARROW-17625
>                 URL: https://issues.apache.org/jira/browse/ARROW-17625
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Parquet, Python
>    Affects Versions: 9.0.0
>            Reporter: Yishai Beeri
>            Priority: Major
>              Labels: Parquet, categorical
>
> Writing a table to parquet, then reading it back fails if:
>  # One of the columns is a dictionary (came from a pandas Categorical), *and*
>  # Passing the table's schema to `read_table`
> Failing on attempt to cast int64 into dictionary (full stack trace below).
> This seems related to ARROW-11157 - but even if losing the categorical type when reading from parquet, the reader should not barf when reading with the schema.
> Minimal example of failing code:
>  
> {code:java}
> import pandas as pd
> import pyarrow as pa
> import pyarrow.parquet as pq
> import pyarrow.dataset as ds
> a = [1,2,3,4,1,2,3,4,1,2,3,4]
> b = ["a" for i in a]
> c = [i for i in range(len(a))]
> df = pd.DataFrame({"a":a, "b":b, "c":c})
> df['a'] = df['a'].astype('category')
> print("df dtypes:\n", df.dtypes)
> t = pa.Table.from_pandas(df, preserve_index=True)
> s = t.schema
> ds.write_dataset(t, format='parquet', base_dir='./test')
> df2 = pq.read_table('./test', schema=s).to_pandas()
> print("df2 dtypes:\n", df2.dtypes)
> {code}
>  
>  
> Which gives: 
>  
> {code:java}
> df dtypes:
>  a    category
> b      object
> c       int64
> dtype: object
> Traceback (most recent call last):
>   File "/Users/yishai/lab/pyarrow_bug/reproduce.py", line 20, in <module>
>     df2 = pq.read_table('./test', schema=s).to_pandas()
>   File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2827, in read_table
>     return dataset.read(columns=columns, use_threads=use_threads,
>   File "/Users/yishai/lab/pyarrow_bug/venv/lib/python3.9/site-packages/pyarrow/parquet/_init_.py", line 2473, in read
>     table = self._dataset.to_table(
>   File "pyarrow/_dataset.pyx", line 331, in pyarrow._dataset.Dataset.to_table
>   File "pyarrow/_dataset.pyx", line 2577, in pyarrow._dataset.Scanner.to_table
>   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
>   File "pyarrow/error.pxi", line 121, in pyarrow.lib.check_status
> pyarrow.lib.ArrowNotImplementedError: Unsupported cast from int64 to dictionary using function cast_dictionary
> {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)