You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Chris Roat (Jira)" <ji...@apache.org> on 2021/01/07 01:25:00 UTC
[jira] [Created] (ARROW-11157) Consistent handling of categoricals
Chris Roat created ARROW-11157:
----------------------------------
Summary: Consistent handling of categoricals
Key: ARROW-11157
URL: https://issues.apache.org/jira/browse/ARROW-11157
Project: Apache Arrow
Issue Type: Improvement
Components: Python
Affects Versions: 2.0.0
Reporter: Chris Roat
What is the current state of categoricals with pyarrow? The `categories` parameter mentioned [in this GitHub|https://github.com/apache/arrow/issues/1688] issue does not seem to be accepted in `pd.read_parquet` anymore. I see that read/write of `int` categoricals does not work, though `str` do -- except if the file is written by fastparquet.
Using pandas 1.1.5, pyarrow 2.0.0, and fastparquet 0.4.1, I see the following handling of categoricals:
{code:java}
import os
import pandas as pd
fname = '/tmp/tst'
data = {
'int': pd.Series([0, 1] * 1000, dtype=pd.CategoricalDtype([0,1])),
'str': pd.Series(['foo', 'bar'] * 1000, dtype=pd.CategoricalDtype(['foo', 'bar'])),
}
df = pd.DataFrame(data)
for write in ['fastparquet', 'pyarrow']:
for read in ['fastparquet', 'pyarrow']:
if os.path.exists(fname):
os.remove(fname)
df.to_parquet(fname, engine=write, compression=None)
df_read = pd.read_parquet(fname, engine=read)
print()
print('write:', write, 'read:', read)
for t in data.keys():
print(t, df[t].dtype == df_read[t].dtype){code}
{noformat}
write: fastparquet read: fastparquet
int True
str True
write: fastparquet read: pyarrow
int False
str False
write: pyarrow read: fastparquet
int True
str True
write: pyarrow read: pyarrow
int False
str True
{noformat}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)