You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/06/07 10:22:00 UTC
[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip
makes pd categorical index not writeable
[ https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858460#comment-16858460 ]
Joris Van den Bossche commented on ARROW-3801:
----------------------------------------------
[~buhrmann] do you know which version of pandas you were using?
As for me, with the combinations of pandas+arrow master or pandas 0.24.2 + arrow 0.12.1, this works fine for me (the reordered categorical its categories get turned into a writable numpy array).
There have been improvements in pandas to deal with read-only arrays related to hastables, such as https://github.com/pandas-dev/pandas/pull/18825 and https://github.com/pandas-dev/pandas/pull/21688, so those might have fixed it.
> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> ------------------------------------------------------------------------
>
> Key: ARROW-3801
> URL: https://issues.apache.org/jira/browse/ARROW-3801
> Project: Apache Arrow
> Issue Type: Bug
> Components: C++, Python
> Affects Versions: 0.10.0
> Reporter: Thomas Buhrmann
> Priority: Major
> Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will make the categorical index non-writeable, which in turn trips up pandas when e.g. reordering the categories, raising "ValueError: buffer source array is read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>
> Outputs:
>
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-365-85b439586c1a> in <module>
> 12 print("DType after:", repr(df2.c1.dtype))
> 13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
> 15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>
>
>
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)