You are viewing a plain text version of this content. The canonical link for it is here.
Posted to issues@arrow.apache.org by "Joris Van den Bossche (JIRA)" <ji...@apache.org> on 2019/06/07 10:22:00 UTC

[jira] [Commented] (ARROW-3801) [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable

    [ https://issues.apache.org/jira/browse/ARROW-3801?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16858460#comment-16858460 ] 

Joris Van den Bossche commented on ARROW-3801:
----------------------------------------------

[~buhrmann] do you know which version of pandas you were using? 

As for me, with the combinations of pandas+arrow master or pandas 0.24.2 + arrow 0.12.1, this works fine for me (the reordered categorical its categories get turned into a writable numpy array).

There have been improvements in pandas to deal with read-only arrays related to hastables, such as https://github.com/pandas-dev/pandas/pull/18825 and https://github.com/pandas-dev/pandas/pull/21688, so those might have fixed it.

> [Python] Pandas-Arrow roundtrip makes pd categorical index not writeable
> ------------------------------------------------------------------------
>
>                 Key: ARROW-3801
>                 URL: https://issues.apache.org/jira/browse/ARROW-3801
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++, Python
>    Affects Versions: 0.10.0
>            Reporter: Thomas Buhrmann
>            Priority: Major
>             Fix For: 0.14.0
>
>
> Serializing and deserializing a pandas series with categorical dtype will make the categorical index non-writeable, which in turn trips up pandas when e.g. reordering the categories, raising "ValueError: buffer source array is read-only" :
> {code}
> import pandas as pd
> import pyarrow as pa
> df = pd.Series([1,2,3], dtype='category', name="c1").to_frame()
> print("DType before:", repr(df.c1.dtype))
> print("Writeable:", df.c1.cat.categories.values.flags.writeable)
> ro = df.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> tbl = pa.Table.from_pandas(df)
> df2 = tbl.to_pandas()
> print("DType after:", repr(df2.c1.dtype))
> print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ro = df2.c1.cat.reorder_categories([3,2,1])
> print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
> Outputs:
>  
> {code:java}
> DType before: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: True
> DType reordered: CategoricalDtype(categories=[3, 2, 1], ordered=False)
> DType after: CategoricalDtype(categories=[1, 2, 3], ordered=False)
> Writeable: False
> ---------------------------------------------------------------------------
> ValueError Traceback (most recent call last)
> <ipython-input-365-85b439586c1a> in <module>
>  12 print("DType after:", repr(df2.c1.dtype))
>  13 print("Writeable:", df2.c1.cat.categories.values.flags.writeable)
> ---> 14 ro = df2.c1.cat.reorder_categories([3,2,1])
>  15 print("DType reordered:", repr(ro.dtype), "\n")
> {code}
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)